[PROPOSAL] Effective storage of duplicates in B-tree index.

Started by Anastasia Lubennikovaover 10 years ago145 messages

a.lubennikova@postgrespro.ru

over 10 years ago

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.

1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?

2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/README>
(chapter - Differences to the Lehman & Yao algorithm).
In the new version they'll become useless. Am I right?

3. Microvacuum.
Killed items are marked LP_DEAD and could be deleted from separate page
at time of insertion.
Now it's fine, because each item corresponds with separate TID. But
posting list implementation requires another way. I've got two ideas:
First is to mark LP_DEAD only those tuples where all TIDs are not visible.
Second is to add LP_DEAD flag to each TID in posting list(tree). This
way requires a bit more space, but allows to do microvacuum of posting
list/tree.
Which one is better?

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Anastasia Lubennikova (#1)

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Hi,

On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.

In general, index size is often a serious issue - cases where indexes
need more space than tables are not quite uncommon in my experience. So
I think the efforts to lower space requirements for indexes are good.

But if we introduce posting lists into btree indexes, how different are
they from GIN? It seems to me that if I create a GIN index (using
btree_gin), I do get mostly the same thing you propose, no?

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
but the compression can only be effective when there are duplicate rows.
So either the index is not UNIQUE (so the b-tree feature is not needed),
or there are many updates.

Which brings me to the other benefit of btree indexes - they are
designed for high concurrency. How much is this going to be affected by
introducing the posting lists?

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

a.korotkov@postgrespro.ru

over 10 years ago

In reply to: Tomas Vondra (#2)

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

Hi, Tomas!

On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:

I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<
https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README

and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.

In general, index size is often a serious issue - cases where indexes need
more space than tables are not quite uncommon in my experience. So I think
the efforts to lower space requirements for indexes are good.

But if we introduce posting lists into btree indexes, how different are
they from GIN? It seems to me that if I create a GIN index (using
btree_gin), I do get mostly the same thing you propose, no?

Yes, In general GIN is a btree with effective duplicates handling + support
of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,

The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's
not feasible to add gingettuple to GIN. At least with same semantics as it
is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does.
Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't
have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be
hangle as "a < x" intersect "x < b". That is extremely inefficient. It is
possible to fix. However, there is no clear proposal how to fit this case
into GIN interface, yet.

but the compression can only be effective when there are duplicate rows.
So either the index is not UNIQUE (so the b-tree feature is not needed), or
there are many updates.

From my observations users can use btree_gin only in some cases. They like
compression, but can't use btree_gin mostly because of #1.

Which brings me to the other benefit of btree indexes - they are designed

for high concurrency. How much is this going to be affected by introducing
the posting lists?

I'd notice that current duplicates handling in PostgreSQL is hack over
original btree. It is designed so in btree access method in PostgreSQL, not
btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you
have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing:
exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of
posting tree.

One can say that concurrency would became worse because index would become
smaller and number of pages would became smaller too. Since number of pages
would be smaller, backends are more likely concur for the same page. But
this argument can be user against any compression and for any bloat.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Alexander Korotkov (#3)

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

On 09/01/2015 11:31 AM, Alexander Korotkov wrote:
...

Yes, In general GIN is a btree with effective duplicates handling +
support of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,

The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And
it's not feasible to add gingettuple to GIN. At least with same
semantics as it is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does.
Multicolumn GIN is more like set of separate singlecolumn GINs: it
doesn't have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would
be hangle as "a < x" intersect "x < b". That is extremely inefficient.
It is possible to fix. However, there is no clear proposal how to fit
this case into GIN interface, yet.

but the compression can only be effective when there are duplicate
rows. So either the index is not UNIQUE (so the b-tree feature is
not needed), or there are many updates.

From my observations users can use btree_gin only in some cases. They
like compression, but can't use btree_gin mostly because of #1.

Thanks for the explanation! I'm not that familiar with GIN internals,
but this mostly matches my understanding. I have only mentioned UNIQUE
because the lack of gettuple() method seems obvious - and it works fine
when GIN indexes are used as "bitmap indexes".

But you're right - we can't do index only scans on GIN indexes, which is
a huge benefit of btree indexes.

Which brings me to the other benefit of btree indexes - they are
designed for high concurrency. How much is this going to be affected
by introducing the posting lists?

I'd notice that current duplicates handling in PostgreSQL is hack over
original btree. It is designed so in btree access method in PostgreSQL,
not btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you
have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar
thing: exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of
posting tree.

OK.

One can say that concurrency would became worse because index would
become smaller and number of pages would became smaller too. Since
number of pages would be smaller, backends are more likely concur for
the same page. But this argument can be user against any compression and
for any bloat.

Which might be a problem for some use cases, but I assume we could add
an option disabling this per-index. Probably having it "off" by default,
and only enabling the compression explicitly.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

over 10 years ago

In reply to: Anastasia Lubennikova (#1)

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.

So it seems to be very useful improvement. Of course it requires a lot of
changes in B-tree implementation, so I need approval from community.

1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?

It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.

2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme (chapter - Differences to the Lehman & Yao
algorithm).
In the new version they'll become useless. Am I right?

I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.

See also, my opinion of how they claim to not need read locks [1]/messages/by-id/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com.
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.

Anyway, the way that I always imagined this would work is a layer
"below" the current implementation. In other words, you could easily
have prefix compression with a prefix that could end at a point within
a reference IndexTuple. It could be any arbitrary point in the second
or subsequent attribute, and would not "care" about the structure of
the IndexTuple when it comes to where attributes begin and end, etc
(although, in reality, in probably would end up caring, because of the
complexity -- not caring is the ideal only, at least to me). As
Alexander pointed out, GIN does not care about composite keys.

That seems quite different to a GIN posting list (something that I
know way less about, FYI). So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.

3. Microvacuum.
Killed items are marked LP_DEAD and could be deleted from separate page at
time of insertion.
Now it's fine, because each item corresponds with separate TID. But posting
list implementation requires another way. I've got two ideas:
First is to mark LP_DEAD only those tuples where all TIDs are not visible.
Second is to add LP_DEAD flag to each TID in posting list(tree). This way
requires a bit more space, but allows to do microvacuum of posting
list/tree.

No real opinion on this point, except that I agree that doing
something is necessary.

Couple of further thoughts on this general topic:

* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.

You should start thinking about how to deal with this in a world where
the physical size could actually be quite variable. The solution is
probably to simply pretend that every IndexTuple is its original size.
This applies to both prefix compression and duplicate suppression, I
suppose.

* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.

[1]: /messages/by-id/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 10 years ago

In reply to: Peter Geoghegan (#5)

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

01.09.2015 21:23, Peter Geoghegan:

On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.

Thank you)

So it seems to be very useful improvement. Of course it requires a lot of
changes in B-tree implementation, so I need approval from community.

1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?

It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.

Hmm.. If we are talking about storing duplicates in posting lists (and
trees) as in GIN, I don't see a way how to apply it to separate pages,
while not applying to others. Look some notes below .

2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme (chapter - Differences to the Lehman & Yao
algorithm).
In the new version they'll become useless. Am I right?

I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.

See also, my opinion of how they claim to not need read locks [1].
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.

Yes, thanks for extensive explanation.
I mean such tricks as moving right in _bt_findinsertloc(), for example.

/*----------
* If we will need to split the page to put the item on this page,
* check whether we can put the tuple somewhere to the right,
* instead. Keep scanning right until we
* (a) find a page with enough free space,
* (b) reach the last page where the tuple can legally go, or
* (c) get tired of searching.
* (c) is not flippant; it is important because if there are many
* pages' worth of equal keys, it's better to split one of the early
* pages than to scan all the way to the end of the run of equal keys
* on every insert. We implement "get tired" as a random choice,
* since stopping after scanning a fixed number of pages wouldn't work
* well (we'd never reach the right-hand side of previously split
* pages). Currently the probability of moving right is set at 0.99,
* which may seem too high to change the behavior much, but it does an
* excellent job of preventing O(N^2) behavior with many equal keys.
*----------
*/

If there is no multiple tuples with the same key, we shouldn't care
about it at all. It would be possible to skip these steps in "effective
B-tree implementation". That's why I want to change btree_version.

So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.

You're right, that is two different techniques.
1. Effective storing of duplicates, which I propose, works with equal
keys. And allow us to delete repeats.
Index tuples are stored like this:

IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) |
IndexTupleData + Attrs (key)

If all Attrs are equal, it seems reasonable not to repeat them. So we
can store it in following structure:

MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData

It is a posting list. It doesn't require significant changes in index
page layout, because we can use ordinary IndexTupleData for meta
information. Each IndexTupleData has fixed size, so it's easy to handle
posting list as an array.

2. Prefix compression handles different keys and somehow compresses them.
I think that it will require non-trivial changes in btree index tuples
representation. Furthermore, any compression leads to extra
computations. Now, I don't have clear idea about how to implement this
technique.

* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.

That is the point where too big posting list transforms to a posting
tree. But I think, that in the first patch, I'll do it another way. Just
by splitting long posting list into 2 lists of appropriate length.

* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.

That is another reason, why I doubt prefix compression, whereas
effective duplicate storage hasn't this problem.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

over 10 years ago

In reply to: Anastasia Lubennikova (#6)

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

On Thu, Sep 3, 2015 at 8:35 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.

That is another reason, why I doubt prefix compression, whereas effective
duplicate storage hasn't this problem.

Okay. That sounds reasonable. I think duplicate handling is a good project.

A good learning tool for Postgres B-Trees -- or at least one of the
better ones -- is my amcheck tool. See:

https://github.com/petergeoghegan/postgres/tree/amcheck

This is a tool for verifying B-Tree invariants hold, which is loosely
based on pageinspect. It checks that certain conditions hold for
B-Trees. A simple example is that all items on each page be in the
correct, logical order. Some invariants checked are far more
complicated, though, and span multiple pages or multiple levels. See
the source code for exact details. This tool works well when running
the regression tests (see stress.sql -- I used it with pgbench), with
no problems reported last I checked. It often only needs light locks
on relations, and single shared locks on buffers. (Buffers are copied
to local memory for the tool to operate on, much like
contrib/pageinspect).

While I have yet to formally submit amcheck to a CF (I once asked for
input on the goals for the project on -hackers), the comments are
fairly comprehensive, and it wouldn't be too hard to adopt this to
guide your work on duplicate handling. Maybe it'll happen for 9.6.
Feedback appreciated.

The tool calls _bt_compare() for many things currently, but doesn't
care about many lower level details, which is (very roughly speaking)
the level that duplicate handling will work at. You aren't actually
proposing to change anything about the fundamental structure that
B-Tree indexes have, so the tool could be quite useful and low-effort
for debugging your code during development.

Debugging this stuff is sometimes like keyhole surgery. If you could
just see at/get to the structure that you care about, it would be 10
times easier. Hopefully this tool makes it easier to identify problems.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

over 10 years ago

In reply to: Peter Geoghegan (#7)

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

On Sun, Sep 27, 2015 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:

Debugging this stuff is sometimes like keyhole surgery. If you could
just see at/get to the structure that you care about, it would be 10
times easier. Hopefully this tool makes it easier to identify problems.

I should add that the way that the L&Y technique works, and the way
that Postgres code is generally very robust/defensive can make direct
testing a difficult thing. I have seen cases where a completely messed
up B-Tree still gave correct results most of the time, and was just
slower. That can happen, for example, because the "move right" thing
results in a degenerate linear scan of the entire index. The
comparisons in the internal pages were totally messed up, but it
"didn't matter" once a scan could get to leaf pages and could move
right and find the value that way.

I wrote amcheck because I thought it was scary how B-Tree indexes
could be *completely* messed up without it being obvious; what hope is
there of a test finding a subtle problem in their structure, then?
Testing the invariants directly seemed like the only way to have a
chance of not introducing bugs when adding new stuff to the B-Tree
code. I believe that adding optimizations to the B-Tree code will be
important in the next couple of years, and there is no other way to
approach it IMV.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Anastasia Lubennikova (#1)

2 attachment(s)

Re: [WIP] Effective storage of duplicates in B-tree index.

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in
B-tree index.
The main idea is to implement posting lists and posting trees for
B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs).
If too many rows having the same key, multiple pages are allocated for
the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same
way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But
I haven't tested it properly yet.

Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions,
but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real
datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

btree_compression_1.0.patchtext/x-patch; name=btree_compression_1.0.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 77c2fdf..3b61e8f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
 
 typedef struct
 {
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
 				  ScanKey scankey,
 				  IndexTuple newtup,
 				  BTStack stack,
-				  Relation heapRel);
+				  Relation heapRel,
+				  bool *updposing);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool updposting = false;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -162,8 +165,9 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		bool fakeupdposting = false; /* Never update posting in unique index */
 
-		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, &fakeupdposting);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -200,8 +204,54 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+						  stack, heapRel, &updposting);
+
+		if (IsSystemRelation(rel))
+			updposting = false;
+
+		/*
+		 * New tuple has the same key with tuple at the page.
+		 * Unite them into one posting.
+		 */
+		if (updposting)
+		{
+			Page		page;
+			IndexTuple olditup, newitup;
+			ItemPointerData *ipd;
+			int nipd;
+
+			page = BufferGetPage(buf);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if (BtreeTupleIsPosting(olditup))
+				nipd = BtreeGetNPosting(olditup);
+			else
+				nipd = 1;
+
+			ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+			/* copy item pointers from old tuple into ipd */
+			if (BtreeTupleIsPosting(olditup))
+				memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+			else
+				memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+			/* add item pointer of the new tuple into ipd */
+			memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+			/*
+			 * Form posting tuple, then delete old tuple and insert posting tuple.
+			 */
+			newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+			PageIndexTupleDelete(page, offset);
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, newitup, offset, false);
+
+			pfree(ipd);
+			pfree(newitup);
+		}
+		else
+		{
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		}
 	}
 	else
 	{
@@ -306,6 +356,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+
+				Assert (!BtreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -535,7 +587,8 @@ _bt_findinsertloc(Relation rel,
 				  ScanKey scankey,
 				  IndexTuple newtup,
 				  BTStack stack,
-				  Relation heapRel)
+				  Relation heapRel,
+				  bool *updposting)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
@@ -681,7 +734,7 @@ _bt_findinsertloc(Relation rel,
 	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
 		newitemoff = firstlegaloff;
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false, updposting);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -1042,6 +1095,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
+		Assert(!BtreeTupleIsPosting(item));
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1128,40 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+		/*
+		 * Truncate posting before insert it as a hikey.
+		 */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 					false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+						false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
 	}
 	leftoff = OffsetNumberNext(leftoff);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..1a3c82b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,6 +74,9 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 			 BlockNumber orig_blkno);
 
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+					  int nitem, int *nremaining);
 
 /*
  *	btbuild() -- build a new btree index.
@@ -948,6 +951,7 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTupleData *remaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -997,31 +1001,62 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
-
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if(BtreeTupleIsPosting(itup))
+				{
+					int nipd, nnewipd;
+					ItemPointer newipd;
+
+					nipd = BtreeGetNPosting(itup);
+					newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+					if (newipd != NULL)
+					{
+						if (nnewipd > 0)
+						{
+							/* There are still some live tuples in the posting.
+							 * 1) form new posting tuple, that contains remaining ipds
+							 * 2) delete "old" posting
+							 * 3) insert new posting back to the page
+							 */
+							remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+							PageIndexTupleDelete(page, offnum);
+
+							if (PageAddItem(page, (Item) remaining, IndexTupleSize(remaining), offnum, false, false) != offnum)
+								elog(ERROR, "failed to add vacuumed posting tuple to index page in \"%s\"",
+										RelationGetRelationName(info->index));
+						}
+						else
+							deletable[ndeletable++] = offnum;
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1132,3 +1167,51 @@ btcanreturn(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_BOOL(true);
 }
+
+
+/*
+ * Vacuums a posting list. The size of the list must be specified
+ * via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+					  int nitem, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	ItemPointer tmpitems = NULL;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+
+	/*
+	 * Iterate over TIDs array
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (callback(items + i, callback_state))
+		{
+			if (!tmpitems)
+			{
+				/*
+				 * First TID to be deleted: allocate memory to hold the
+				 * remaining items.
+				 */
+				tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+			}
+		}
+		else
+		{
+			if (tmpitems)
+				tmpitems[remaining] = items[i];
+			remaining++;
+		}
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69a057..ef220b2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -90,6 +92,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		   Buffer *bufP, int access)
 {
 	BTStack		stack_in = NULL;
+	bool fakeupdposting = false; /* fake variable for _bt_binsrch */
 
 	/* Get the root page to start with */
 	*bufP = _bt_getroot(rel, access);
@@ -136,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey, &fakeupdposting);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
@@ -310,7 +313,8 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
-			bool nextkey)
+			bool nextkey,
+			bool *updposing)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
 	 * scan key), which could be the last slot + 1.
 	 */
 	if (P_ISLEAF(opaque))
+	{
+		if (low <= PageGetMaxOffsetNumber(page))
+		{
+			IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low));
+			/* one excessive check of equality. for possible posting tuple update or creation */
+			if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+				&& (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page)))
+				*updposing = true;
+		}
 		return low;
+	}
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -536,6 +550,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	int			i;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	bool fakeupdposing = false; /* fake variable for _bt_binsrch */
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -1003,7 +1018,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	so->markItemIndex = -1;		/* ditto */
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey, &fakeupdposing);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1161,6 +1176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	IndexTuple	itup;
 	bool		continuescan;
+	int i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1211,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1232,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
 			}
 			if (!continuescan)
 			{
@@ -1228,7 +1256,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			offnum = OffsetNumberNext(offnum);
 		}
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1236,7 +1264,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPackedIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1246,8 +1274,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1261,8 +1301,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1275,6 +1315,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert (!BtreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1288,6 +1330,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BtreeGetPostingOffset(itup);
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..79a737f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,6 +75,7 @@
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -527,15 +528,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(last_off > P_FIRSTKEY);
 		ii = PageGetItemId(opage, last_off);
 		oitup = (IndexTuple) PageGetItem(opage, ii);
-		_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
 
 		/*
-		 * Move 'last' into the high key position on opage
+		 * If the item is PostingTuple, we can cut it.
+		 * Because HIKEY is not considered as real data, and it needn't to keep any ItemPointerData at all.
+		 * And of course it needn't to keep a list of ipd.
+		 * But, if it had a big posting list, there will be plenty of free space on the opage.
+		 * So we must split Posting tuple into 2 pieces.
 		 */
-		hii = PageGetItemId(opage, P_HIKEY);
-		*hii = *ii;
-		ItemIdSetUnused(ii);	/* redundant */
-		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		 if (BtreeTupleIsPosting(oitup))
+		 {
+			int nipd, ntocut, ntoleave;
+			Size keytupsz;
+			IndexTuple keytup;
+			nipd = BtreeGetNPosting(oitup);
+			ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+			ntocut++; /* round up to be sure that we cut enough */
+			ntoleave = nipd - ntocut;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(oitup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, oitup, keytupsz);
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+			if (ntocut < nipd)
+			{
+				ItemPointerData *newipd;
+				IndexTuple newitup, newlasttup;
+				/*
+				 * 1) Cut part of old tuple to shift to npage.
+				 * And insert it as P_FIRSTKEY.
+				 * This tuple is based on keytup.
+				 * Blkno & offnum are reset in BtreeFormPackedTuple.
+				 */
+				newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+				/* Note, that we cut last 'ntocut' items */
+				memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+				newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+				_bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+				pfree(newipd);
+				pfree(newitup);
+
+				/*
+				 * 2) set last item to the P_HIKEY linp
+				 * Move 'last' into the high key position on opage
+				 * NOTE: Do this because of indextuple deletion algorithm, which
+				 * doesn't allow to delete an item while we have unused one before it.
+				 */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key */
+				PageIndexTupleDelete(opage, P_HIKEY);
+
+				/* 4)Insert keytup as P_HIKEY. */
+				_bt_sortaddtup(opage, IndexTupleSize(keytup), keytup,  P_HIKEY);
+
+				/* 5) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+				newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+				_bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+				pfree(newlasttup);
+			}
+			else
+			{
+				/* The tuple isn't big enough to split it. Handle it as a normal tuple. */
+
+				/*
+				 * 1) Shift the last tuple to npage.
+				 * Insert it as P_FIRSTKEY.
+				 */
+				_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+				/* 2) set last item to the P_HIKEY linp */
+				/* Move 'last' into the high key position on opage */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key */
+				PageIndexTupleDelete(opage, P_HIKEY);
+
+				/* 4)Insert keytup as P_HIKEY. */
+				_bt_sortaddtup(opage, IndexTupleSize(keytup), keytup,  P_HIKEY);
+
+			}
+			pfree(keytup);
+		 }
+		 else
+		 {
+			/*
+			 * 1) Shift the last tuple to npage.
+			 * Insert it as P_FIRSTKEY.
+			 */
+			_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+			/* 2) set last item to the P_HIKEY linp */
+			/* Move 'last' into the high key position on opage */
+			hii = PageGetItemId(opage, P_HIKEY);
+			*hii = *ii;
+			ItemIdSetUnused(ii);	/* redundant */
+			((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		}
 
 		/*
 		 * Link the old page into its parent, using its minimum key. If we
@@ -547,6 +653,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 		Assert(state->btps_minkey != NULL);
 		ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
 		_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
 		pfree(state->btps_minkey);
 
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * it off the old page, not the new one, in case we are not at leaf
 		 * level.
 		 */
-		state->btps_minkey = CopyIndexTuple(oitup);
+		ItemId iihk = PageGetItemId(opage, P_HIKEY);
+		IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+		state->btps_minkey = CopyIndexTuple(hikey);
 
 		/*
 		 * Set the sibling links for both pages.
@@ -590,7 +699,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
+
+		if (BtreeTupleIsPosting(itup))
+		{
+			Size keytupsz;
+			IndexTuple keytup;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+			state->btps_minkey = CopyIndexTuple(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);
 	}
 
 	/*
@@ -670,6 +801,67 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+	/* Prepare SortSupport data for each column */
+	ScanKey		indexScanKey = _bt_mkscankey_nodata(wstate->index);
+	SortSupport sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+	int i;
+
+	for (i = 0; i < keysz; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+		ScanKey		scanKey = indexScanKey + i;
+		int16		strategy;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = scanKey->sk_collation;
+		sortKey->ssup_nulls_first =
+			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+		sortKey->ssup_attno = scanKey->sk_attno;
+		/* Abbreviation is not supported here */
+		sortKey->abbreviate = false;
+
+		AssertState(sortKey->ssup_attno != 0);
+
+		strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+			BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+		PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+	}
+
+	_bt_freeskey(indexScanKey);
+	return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey i
+ */
+int _bt_call_comparator(SortSupport sortKeys, int i,
+						 IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+		SortSupport entry;
+		Datum		attrDatum1,
+					attrDatum2;
+		bool		isNull1,
+					isNull2;
+		int32		compare;
+
+		entry = sortKeys + i - 1;
+		attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+		attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+		compare = ApplySortComparator(attrDatum1, isNull1,
+										attrDatum2, isNull2,
+										entry);
+
+		return compare;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -679,16 +871,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
-				itup2 = NULL;
+				itup2 = NULL,
+				itupprev = NULL;
 	bool		should_free,
 				should_free2,
 				load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = RelationGetNumberOfAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
+	int			ntuples = 0;
 	SortSupport sortKeys;
 
+	/* Prepare SortSupport data */
+	sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
 	if (merge)
 	{
 		/*
@@ -701,34 +897,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 									   true, &should_free);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate,
 										true, &should_free2);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
-		/* Prepare SortSupport data for each column */
-		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
-		for (i = 0; i < keysz; i++)
-		{
-			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
-			int16		strategy;
-
-			sortKey->ssup_cxt = CurrentMemoryContext;
-			sortKey->ssup_collation = scanKey->sk_collation;
-			sortKey->ssup_nulls_first =
-				(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
-			sortKey->ssup_attno = scanKey->sk_attno;
-			/* Abbreviation is not supported here */
-			sortKey->abbreviate = false;
-
-			AssertState(sortKey->ssup_attno != 0);
-
-			strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
-				BTGreaterStrategyNumber : BTLessStrategyNumber;
-
-			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
-		}
-
-		_bt_freeskey(indexScanKey);
 
 		for (;;)
 		{
@@ -742,20 +910,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			{
 				for (i = 1; i <= keysz; i++)
 				{
-					SortSupport entry;
-					Datum		attrDatum1,
-								attrDatum2;
-					bool		isNull1,
-								isNull2;
-					int32		compare;
-
-					entry = sortKeys + i - 1;
-					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
-					attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
-					compare = ApplySortComparator(attrDatum1, isNull1,
-												  attrDatum2, isNull2,
-												  entry);
+					int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
 					if (compare > 0)
 					{
 						load1 = false;
@@ -794,19 +950,137 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	else
 	{
 		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+
+		Relation indexRelation = wstate->index;
+		Form_pg_index index = indexRelation->rd_index;
+
+		if (index->indisunique)
+		{
+			/* Do not use compression for unique indexes. */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
+
+				_bt_buildadd(wstate, state, itup);
+				if (should_free)
+					pfree(itup);
+			}
+		}
+		else
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			ItemPointerData *ipd = NULL;
+			IndexTuple 		postingtuple;
+			Size			maxitemsize = 0,
+							maxpostingsize = 0;
+			int32 			compare = 0;
 
-			_bt_buildadd(wstate, state, itup);
-			if (should_free)
-				pfree(itup);
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				/*
+				 * Compare current tuple with previous one.
+				 * If tuples are equal, we can unite them into a posting list.
+				 */
+				if (itupprev != NULL)
+				{
+					/* compare tuples */
+					compare = 0;
+					for (i = 1; i <= keysz; i++)
+					{
+						compare = _bt_call_comparator(sortKeys, i, itup, itupprev, tupdes);
+						if (compare != 0)
+							break;
+					}
+
+					if (compare == 0)
+					{
+						/* Tuples are equal. Create or update posting */
+						if (ntuples == 0)
+						{
+							/*
+							 * We haven't suitable posting list yet, so allocate
+							 * it and save both itupprev and current tuple.
+							 */
+
+							ipd = palloc0(maxitemsize);
+
+							memcpy(ipd, itupprev, sizeof(ItemPointerData));
+							ntuples++;
+							memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+							ntuples++;
+						}
+						else
+						{
+							if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+							{
+								memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+								ntuples++;
+							}
+							else
+							{
+								postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+								_bt_buildadd(wstate, state, postingtuple);
+								ntuples = 0;
+								pfree(ipd);
+							}
+						}
+
+					}
+					else
+					{
+						/* Tuples aren't equal. Insert itupprev into index. */
+						if (ntuples == 0)
+							_bt_buildadd(wstate, state, itupprev);
+						else
+						{
+							postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+							_bt_buildadd(wstate, state, postingtuple);
+							ntuples = 0;
+							pfree(ipd);
+						}
+					}
+				}
+
+				/*
+				 * Copy the tuple into temp variable itupprev
+				 * to compare it with the following tuple
+				 * and maybe unite them into a posting tuple
+				 */
+				itupprev = CopyIndexTuple(itup);
+				if (should_free)
+					pfree(itup);
+
+				/* compute max size of ipd list */
+				maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+			}
+
+			/* Handle the last item.*/
+			if (ntuples == 0)
+			{
+				if (itupprev != NULL)
+					_bt_buildadd(wstate, state, itupprev);
+			}
+			else
+			{
+				Assert(ipd!=NULL);
+				Assert(itupprev != NULL);
+				postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+				_bt_buildadd(wstate, state, postingtuple);
+				ntuples = 0;
+				pfree(ipd);
+			}
 		}
 	}
-
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 91331ba..ed3dff7 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BtreeTupleIsPosting(ituple)
+				&& (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2070,3 +2072,71 @@ btoptions(PG_FUNCTION_ARGS)
 		PG_RETURN_BYTEA_P(result);
 	PG_RETURN_NULL();
 }
+
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int i;
+	uint32		newsize;
+	IndexTuple itup = CopyIndexTuple(tuple);
+
+	/*
+	 * Determine and store offset to the posting list.
+	 */
+	newsize = IndexTupleSize(itup);
+	newsize = SHORTALIGN(newsize);
+
+	/*
+	 * Set meta info about the posting list.
+	 */
+	BtreeSetPostingOffset(itup, newsize);
+	BtreeSetNPosting(itup, nipd);
+	/*
+	 * Add space needed for posting list, if any.  Then check that the tuple
+	 * won't be too big to store.
+	 */
+	newsize += sizeof(ItemPointerData)*nipd;
+	newsize = MAXALIGN(newsize);
+
+	/*
+	 * Resize tuple if needed
+	 */
+	if (newsize != IndexTupleSize(itup))
+	{
+		itup = repalloc(itup, newsize);
+
+		/*
+		 * PostgreSQL 9.3 and earlier did not clear this new space, so we
+		 * might find uninitialized padding when reading tuples from disk.
+		 */
+		memset((char *) itup + IndexTupleSize(itup),
+			   0, newsize - IndexTupleSize(itup));
+		/* set new size in tuple header */
+		itup->t_info &= ~INDEX_SIZE_MASK;
+		itup->t_info |= newsize;
+	}
+
+	/*
+	 * Copy data into the posting tuple
+	 */
+	memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+	return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int size;
+	if (BtreeTupleIsPosting(tuple))
+	{
+		size = BtreeGetPostingOffset(tuple);
+		tuple->t_info &= ~INDEX_SIZE_MASK;
+		tuple->t_info |= size;
+	}
+
+	return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index c997545..d79d5cd 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -137,7 +137,12 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
 			(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
+#define MaxPackedIndexTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+			(sizeof(ItemPointerData))))
+// #define MaxIndexTuplesPerPage	\
+// 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+// 			(sizeof(ItemPointerData))))
 
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9e48efd..8cf0edc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -75,6 +75,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_POSTING (1 << 8)		/* page contains compressed duplicates (only for leaf pages) */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -181,6 +182,8 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
 #define P_HAS_GARBAGE(opaque)	((opaque)->btpo_flags & BTP_HAS_GARBAGE)
 #define P_INCOMPLETE_SPLIT(opaque)	((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+#define P_HAS_POSTING(opaque)		((opaque)->btpo_flags & BTP_HAS_POSTING)
+
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -536,6 +539,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for Posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -548,7 +553,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -649,6 +654,28 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+	BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+	((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup)			BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n)		ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup)		(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n)	ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup)    	(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup)			(ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n)		(ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -705,8 +732,8 @@ extern BTStack _bt_search(Relation rel,
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
 			  int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+								ScanKey scankey, bool nextkey, bool* updposting);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
 			Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -736,6 +763,8 @@ extern void _bt_end_vacuum(Relation rel);
 extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
 
 /*
  * prototypes for functions in nbtsort.c

btree_compression_test.sqlapplication/sql; name=btree_compression_test.sqlDownload

#10

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#9)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting list/tree
and significantly decrease index size. Read more in presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way
as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I
haven't tested it properly yet.

Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but
on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets.
Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up
past commit 65c5fcd35?

Thom

#11

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Thom Brown (#10)

1 attachment(s)

Re: [WIP] Effective storage of duplicates in B-tree index.

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in
B-tree index.
The main idea is to implement posting lists and posting trees for
B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is
organised as follows.
Index stores single index tuple for each unique key. That index
tuple points to posting list which contains pointers to heap
tuples (TIDs). If too many rows having the same key, multiple
pages are allocated for the TIDs and these constitute so called
posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row
that we index.
It can possibly cause page split. Because of MVCC even unique
index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid
superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the
same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key,
TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any
changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql
file.
i - number of distinct values in the index. So i=1 means that all
rows have the same key, and i=10000000 means that all keys are
different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for
B-tree, which is quite logical - there is no compression if all
the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list
compression yet. So there is still potential to decrease size of
compressed btree.

I'm almost sure, there are still some tiny bugs and missed
functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real
datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the
index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral
bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near
future:
1. Add storage_parameter 'enable_compression' for btree access
method which specifies whether the index handles duplicates.
default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower
with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught
up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

btree_compression_1.0(rebased).patchtext/x-patch; name="btree_compression_1.0(rebased).patch"Download

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
 	info.hash = pgss_hash_fn;
 	info.match = pgss_match_fn;
 	pgss_hash = ShmemInitHash("pg_stat_statements hash",
-							  pgss_max, pgss_max,
+							  pgss_max,
 							  &info,
 							  HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..3908cc1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
 
 typedef struct
 {
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
 				  ScanKey scankey,
 				  IndexTuple newtup,
 				  BTStack stack,
-				  Relation heapRel);
+				  Relation heapRel,
+				  bool *updposing);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool updposting = false;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -162,8 +165,9 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		bool fakeupdposting = false; /* Never update posting in unique index */
 
-		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, &fakeupdposting);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -200,8 +204,54 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+						  stack, heapRel, &updposting);
+
+		if (IsSystemRelation(rel))
+			updposting = false;
+
+		/*
+		 * New tuple has the same key with tuple at the page.
+		 * Unite them into one posting.
+		 */
+		if (updposting)
+		{
+			Page		page;
+			IndexTuple olditup, newitup;
+			ItemPointerData *ipd;
+			int nipd;
+
+			page = BufferGetPage(buf);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if (BtreeTupleIsPosting(olditup))
+				nipd = BtreeGetNPosting(olditup);
+			else
+				nipd = 1;
+
+			ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+			/* copy item pointers from old tuple into ipd */
+			if (BtreeTupleIsPosting(olditup))
+				memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+			else
+				memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+			/* add item pointer of the new tuple into ipd */
+			memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+			/*
+			 * Form posting tuple, then delete old tuple and insert posting tuple.
+			 */
+			newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+			PageIndexTupleDelete(page, offset);
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, newitup, offset, false);
+
+			pfree(ipd);
+			pfree(newitup);
+		}
+		else
+		{
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		}
 	}
 	else
 	{
@@ -306,6 +356,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+
+				Assert (!BtreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -535,7 +587,8 @@ _bt_findinsertloc(Relation rel,
 				  ScanKey scankey,
 				  IndexTuple newtup,
 				  BTStack stack,
-				  Relation heapRel)
+				  Relation heapRel,
+				  bool *updposting)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
@@ -681,7 +734,7 @@ _bt_findinsertloc(Relation rel,
 	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
 		newitemoff = firstlegaloff;
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false, updposting);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -1042,6 +1095,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
+		Assert(!BtreeTupleIsPosting(item));
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1128,40 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+		/*
+		 * Truncate posting before insert it as a hikey.
+		 */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 					false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+						false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
 	}
 	leftoff = OffsetNumberNext(leftoff);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..f56c90f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -75,6 +75,9 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 			 BlockNumber orig_blkno);
 
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+					  int nitem, int *nremaining);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +965,7 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTupleData *remaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1015,62 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
-
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if(BtreeTupleIsPosting(itup))
+				{
+					int nipd, nnewipd;
+					ItemPointer newipd;
+
+					nipd = BtreeGetNPosting(itup);
+					newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+					if (newipd != NULL)
+					{
+						if (nnewipd > 0)
+						{
+							/* There are still some live tuples in the posting.
+							 * 1) form new posting tuple, that contains remaining ipds
+							 * 2) delete "old" posting
+							 * 3) insert new posting back to the page
+							 */
+							remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+							PageIndexTupleDelete(page, offnum);
+
+							if (PageAddItem(page, (Item) remaining, IndexTupleSize(remaining), offnum, false, false) != offnum)
+								elog(ERROR, "failed to add vacuumed posting tuple to index page in \"%s\"",
+										RelationGetRelationName(info->index));
+						}
+						else
+							deletable[ndeletable++] = offnum;
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1160,3 +1195,51 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+
+/*
+ * Vacuums a posting list. The size of the list must be specified
+ * via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+					  int nitem, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	ItemPointer tmpitems = NULL;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+
+	/*
+	 * Iterate over TIDs array
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (callback(items + i, callback_state))
+		{
+			if (!tmpitems)
+			{
+				/*
+				 * First TID to be deleted: allocate memory to hold the
+				 * remaining items.
+				 */
+				tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+			}
+		}
+		else
+		{
+			if (tmpitems)
+				tmpitems[remaining] = items[i];
+			remaining++;
+		}
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..0428f04 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -90,6 +92,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		   Buffer *bufP, int access)
 {
 	BTStack		stack_in = NULL;
+	bool fakeupdposting = false; /* fake variable for _bt_binsrch */
 
 	/* Get the root page to start with */
 	*bufP = _bt_getroot(rel, access);
@@ -136,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey, &fakeupdposting);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
@@ -310,7 +313,8 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
-			bool nextkey)
+			bool nextkey,
+			bool *updposing)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
 	 * scan key), which could be the last slot + 1.
 	 */
 	if (P_ISLEAF(opaque))
+	{
+		if (low <= PageGetMaxOffsetNumber(page))
+		{
+			IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low));
+			/* one excessive check of equality. for possible posting tuple update or creation */
+			if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+				&& (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page)))
+				*updposing = true;
+		}
 		return low;
+	}
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -536,6 +550,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	int			i;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	bool fakeupdposing = false; /* fake variable for _bt_binsrch */
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -1003,7 +1018,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	so->markItemIndex = -1;		/* ditto */
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey, &fakeupdposing);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1161,6 +1176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	IndexTuple	itup;
 	bool		continuescan;
+	int i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1211,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1232,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
 			}
 			if (!continuescan)
 			{
@@ -1228,7 +1256,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			offnum = OffsetNumberNext(offnum);
 		}
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1236,7 +1264,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPackedIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1246,8 +1274,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1261,8 +1301,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1275,6 +1315,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert (!BtreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1288,6 +1330,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BtreeGetPostingOffset(itup);
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e29d63f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,6 +75,7 @@
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -527,15 +528,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(last_off > P_FIRSTKEY);
 		ii = PageGetItemId(opage, last_off);
 		oitup = (IndexTuple) PageGetItem(opage, ii);
-		_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
 
 		/*
-		 * Move 'last' into the high key position on opage
+		 * If the item is PostingTuple, we can cut it.
+		 * Because HIKEY is not considered as real data, and it needn't to keep any ItemPointerData at all.
+		 * And of course it needn't to keep a list of ipd.
+		 * But, if it had a big posting list, there will be plenty of free space on the opage.
+		 * So we must split Posting tuple into 2 pieces.
 		 */
-		hii = PageGetItemId(opage, P_HIKEY);
-		*hii = *ii;
-		ItemIdSetUnused(ii);	/* redundant */
-		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		 if (BtreeTupleIsPosting(oitup))
+		 {
+			int nipd, ntocut, ntoleave;
+			Size keytupsz;
+			IndexTuple keytup;
+			nipd = BtreeGetNPosting(oitup);
+			ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+			ntocut++; /* round up to be sure that we cut enough */
+			ntoleave = nipd - ntocut;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(oitup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, oitup, keytupsz);
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+			if (ntocut < nipd)
+			{
+				ItemPointerData *newipd;
+				IndexTuple newitup, newlasttup;
+				/*
+				 * 1) Cut part of old tuple to shift to npage.
+				 * And insert it as P_FIRSTKEY.
+				 * This tuple is based on keytup.
+				 * Blkno & offnum are reset in BtreeFormPackedTuple.
+				 */
+				newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+				/* Note, that we cut last 'ntocut' items */
+				memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+				newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+				_bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+				pfree(newipd);
+				pfree(newitup);
+
+				/*
+				 * 2) set last item to the P_HIKEY linp
+				 * Move 'last' into the high key position on opage
+				 * NOTE: Do this because of indextuple deletion algorithm, which
+				 * doesn't allow to delete an item while we have unused one before it.
+				 */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key */
+				PageIndexTupleDelete(opage, P_HIKEY);
+
+				/* 4)Insert keytup as P_HIKEY. */
+				_bt_sortaddtup(opage, IndexTupleSize(keytup), keytup,  P_HIKEY);
+
+				/* 5) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+				newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+				_bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+				pfree(newlasttup);
+			}
+			else
+			{
+				/* The tuple isn't big enough to split it. Handle it as a normal tuple. */
+
+				/*
+				 * 1) Shift the last tuple to npage.
+				 * Insert it as P_FIRSTKEY.
+				 */
+				_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+				/* 2) set last item to the P_HIKEY linp */
+				/* Move 'last' into the high key position on opage */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key */
+				PageIndexTupleDelete(opage, P_HIKEY);
+
+				/* 4)Insert keytup as P_HIKEY. */
+				_bt_sortaddtup(opage, IndexTupleSize(keytup), keytup,  P_HIKEY);
+
+			}
+			pfree(keytup);
+		 }
+		 else
+		 {
+			/*
+			 * 1) Shift the last tuple to npage.
+			 * Insert it as P_FIRSTKEY.
+			 */
+			_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+			/* 2) set last item to the P_HIKEY linp */
+			/* Move 'last' into the high key position on opage */
+			hii = PageGetItemId(opage, P_HIKEY);
+			*hii = *ii;
+			ItemIdSetUnused(ii);	/* redundant */
+			((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		}
 
 		/*
 		 * Link the old page into its parent, using its minimum key. If we
@@ -547,6 +653,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 		Assert(state->btps_minkey != NULL);
 		ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
 		_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
 		pfree(state->btps_minkey);
 
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * it off the old page, not the new one, in case we are not at leaf
 		 * level.
 		 */
-		state->btps_minkey = CopyIndexTuple(oitup);
+		ItemId iihk = PageGetItemId(opage, P_HIKEY);
+		IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+		state->btps_minkey = CopyIndexTuple(hikey);
 
 		/*
 		 * Set the sibling links for both pages.
@@ -590,7 +699,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
+
+		if (BtreeTupleIsPosting(itup))
+		{
+			Size keytupsz;
+			IndexTuple keytup;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+			state->btps_minkey = CopyIndexTuple(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);
 	}
 
 	/*
@@ -670,6 +801,67 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+	/* Prepare SortSupport data for each column */
+	ScanKey		indexScanKey = _bt_mkscankey_nodata(wstate->index);
+	SortSupport sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+	int i;
+
+	for (i = 0; i < keysz; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+		ScanKey		scanKey = indexScanKey + i;
+		int16		strategy;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = scanKey->sk_collation;
+		sortKey->ssup_nulls_first =
+			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+		sortKey->ssup_attno = scanKey->sk_attno;
+		/* Abbreviation is not supported here */
+		sortKey->abbreviate = false;
+
+		AssertState(sortKey->ssup_attno != 0);
+
+		strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+			BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+		PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+	}
+
+	_bt_freeskey(indexScanKey);
+	return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey i
+ */
+int _bt_call_comparator(SortSupport sortKeys, int i,
+						 IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+		SortSupport entry;
+		Datum		attrDatum1,
+					attrDatum2;
+		bool		isNull1,
+					isNull2;
+		int32		compare;
+
+		entry = sortKeys + i - 1;
+		attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+		attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+		compare = ApplySortComparator(attrDatum1, isNull1,
+										attrDatum2, isNull2,
+										entry);
+
+		return compare;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -679,16 +871,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
-				itup2 = NULL;
+				itup2 = NULL,
+				itupprev = NULL;
 	bool		should_free,
 				should_free2,
 				load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = RelationGetNumberOfAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
+	int			ntuples = 0;
 	SortSupport sortKeys;
 
+	/* Prepare SortSupport data */
+	sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
 	if (merge)
 	{
 		/*
@@ -701,34 +897,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 									   true, &should_free);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate,
 										true, &should_free2);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
-		/* Prepare SortSupport data for each column */
-		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
-		for (i = 0; i < keysz; i++)
-		{
-			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
-			int16		strategy;
-
-			sortKey->ssup_cxt = CurrentMemoryContext;
-			sortKey->ssup_collation = scanKey->sk_collation;
-			sortKey->ssup_nulls_first =
-				(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
-			sortKey->ssup_attno = scanKey->sk_attno;
-			/* Abbreviation is not supported here */
-			sortKey->abbreviate = false;
-
-			AssertState(sortKey->ssup_attno != 0);
-
-			strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
-				BTGreaterStrategyNumber : BTLessStrategyNumber;
-
-			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
-		}
-
-		_bt_freeskey(indexScanKey);
 
 		for (;;)
 		{
@@ -742,20 +910,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			{
 				for (i = 1; i <= keysz; i++)
 				{
-					SortSupport entry;
-					Datum		attrDatum1,
-								attrDatum2;
-					bool		isNull1,
-								isNull2;
-					int32		compare;
-
-					entry = sortKeys + i - 1;
-					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
-					attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
-					compare = ApplySortComparator(attrDatum1, isNull1,
-												  attrDatum2, isNull2,
-												  entry);
+					int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
 					if (compare > 0)
 					{
 						load1 = false;
@@ -794,19 +950,137 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	else
 	{
 		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+
+		Relation indexRelation = wstate->index;
+		Form_pg_index index = indexRelation->rd_index;
+
+		if (index->indisunique)
+		{
+			/* Do not use compression for unique indexes. */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
+
+				_bt_buildadd(wstate, state, itup);
+				if (should_free)
+					pfree(itup);
+			}
+		}
+		else
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			ItemPointerData *ipd = NULL;
+			IndexTuple 		postingtuple;
+			Size			maxitemsize = 0,
+							maxpostingsize = 0;
+			int32 			compare = 0;
 
-			_bt_buildadd(wstate, state, itup);
-			if (should_free)
-				pfree(itup);
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				/*
+				 * Compare current tuple with previous one.
+				 * If tuples are equal, we can unite them into a posting list.
+				 */
+				if (itupprev != NULL)
+				{
+					/* compare tuples */
+					compare = 0;
+					for (i = 1; i <= keysz; i++)
+					{
+						compare = _bt_call_comparator(sortKeys, i, itup, itupprev, tupdes);
+						if (compare != 0)
+							break;
+					}
+
+					if (compare == 0)
+					{
+						/* Tuples are equal. Create or update posting */
+						if (ntuples == 0)
+						{
+							/*
+							 * We haven't suitable posting list yet, so allocate
+							 * it and save both itupprev and current tuple.
+							 */
+
+							ipd = palloc0(maxitemsize);
+
+							memcpy(ipd, itupprev, sizeof(ItemPointerData));
+							ntuples++;
+							memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+							ntuples++;
+						}
+						else
+						{
+							if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+							{
+								memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+								ntuples++;
+							}
+							else
+							{
+								postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+								_bt_buildadd(wstate, state, postingtuple);
+								ntuples = 0;
+								pfree(ipd);
+							}
+						}
+
+					}
+					else
+					{
+						/* Tuples aren't equal. Insert itupprev into index. */
+						if (ntuples == 0)
+							_bt_buildadd(wstate, state, itupprev);
+						else
+						{
+							postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+							_bt_buildadd(wstate, state, postingtuple);
+							ntuples = 0;
+							pfree(ipd);
+						}
+					}
+				}
+
+				/*
+				 * Copy the tuple into temp variable itupprev
+				 * to compare it with the following tuple
+				 * and maybe unite them into a posting tuple
+				 */
+				itupprev = CopyIndexTuple(itup);
+				if (should_free)
+					pfree(itup);
+
+				/* compute max size of ipd list */
+				maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+			}
+
+			/* Handle the last item.*/
+			if (ntuples == 0)
+			{
+				if (itupprev != NULL)
+					_bt_buildadd(wstate, state, itupprev);
+			}
+			else
+			{
+				Assert(ipd!=NULL);
+				Assert(itupprev != NULL);
+				postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+				_bt_buildadd(wstate, state, postingtuple);
+				ntuples = 0;
+				pfree(ipd);
+			}
 		}
 	}
-
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..0291342 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BtreeTupleIsPosting(ituple)
+				&& (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2063,3 +2065,71 @@ btoptions(Datum reloptions, bool validate)
 {
 	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
 }
+
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int i;
+	uint32		newsize;
+	IndexTuple itup = CopyIndexTuple(tuple);
+
+	/*
+	 * Determine and store offset to the posting list.
+	 */
+	newsize = IndexTupleSize(itup);
+	newsize = SHORTALIGN(newsize);
+
+	/*
+	 * Set meta info about the posting list.
+	 */
+	BtreeSetPostingOffset(itup, newsize);
+	BtreeSetNPosting(itup, nipd);
+	/*
+	 * Add space needed for posting list, if any.  Then check that the tuple
+	 * won't be too big to store.
+	 */
+	newsize += sizeof(ItemPointerData)*nipd;
+	newsize = MAXALIGN(newsize);
+
+	/*
+	 * Resize tuple if needed
+	 */
+	if (newsize != IndexTupleSize(itup))
+	{
+		itup = repalloc(itup, newsize);
+
+		/*
+		 * PostgreSQL 9.3 and earlier did not clear this new space, so we
+		 * might find uninitialized padding when reading tuples from disk.
+		 */
+		memset((char *) itup + IndexTupleSize(itup),
+			   0, newsize - IndexTupleSize(itup));
+		/* set new size in tuple header */
+		itup->t_info &= ~INDEX_SIZE_MASK;
+		itup->t_info |= newsize;
+	}
+
+	/*
+	 * Copy data into the posting tuple
+	 */
+	memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+	return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int size;
+	if (BtreeTupleIsPosting(tuple))
+	{
+		size = BtreeGetPostingOffset(tuple);
+		tuple->t_info &= ~INDEX_SIZE_MASK;
+		tuple->t_info |= size;
+	}
+
+	return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-								  size, size,
+								  size,
 								  &info,
 								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 }
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
 	hash_flags = HASH_ELEM;
 
 	ShmemIndex = ShmemInitHash("ShmemIndex",
-							   SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+							   SHMEM_INDEX_SIZE,
 							   &info, hash_flags);
 }
 
@@ -255,17 +255,12 @@ InitShmemIndex(void)
  * exceeded substantially (since it's used to compute directory size and
  * the hash table buckets will get overfull).
  *
- * init_size is the number of hashtable entries to preallocate.  For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
  */
 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-			  long init_size,	/* initial table size */
 			  long max_size,	/* max size of the table */
 			  HASHCTL *infoP,	/* info about key and bucket size */
 			  int hash_flags)	/* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
 	/* Pass location of hashtable header to hash_create */
 	infoP->hctl = (HASHHDR *) location;
 
-	return hash_create(name, init_size, infoP, hash_flags);
+	return hash_create(name, max_size, infoP, hash_flags);
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9c2e49c..8d9b36a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,8 +373,7 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-				max_table_size;
+	long		max_table_size;
 	bool		found;
 
 	/*
@@ -382,7 +381,6 @@ InitLocks(void)
 	 * calculations must agree with LockShmemSize!
 	 */
 	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
@@ -394,14 +392,12 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-									   init_table_size,
 									   max_table_size,
 									   &info,
 									HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
 	max_table_size *= 2;
-	init_table_size *= 2;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -413,7 +409,6 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-										   init_table_size,
 										   max_table_size,
 										   &info,
 								 HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d9d4e22..fc72d2d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1116,7 +1116,6 @@ InitPredicateLocks(void)
 
 	PredicateLockTargetHash = ShmemInitHash("PREDICATELOCKTARGET hash",
 											max_table_size,
-											max_table_size,
 											&info,
 											HASH_ELEM | HASH_BLOBS |
 											HASH_PARTITION | HASH_FIXED_SIZE);
@@ -1144,7 +1143,6 @@ InitPredicateLocks(void)
 
 	PredicateLockHash = ShmemInitHash("PREDICATELOCK hash",
 									  max_table_size,
-									  max_table_size,
 									  &info,
 									  HASH_ELEM | HASH_FUNCTION |
 									  HASH_PARTITION | HASH_FIXED_SIZE);
@@ -1225,7 +1223,6 @@ InitPredicateLocks(void)
 
 	SerializableXidHash = ShmemInitHash("SERIALIZABLEXID hash",
 										max_table_size,
-										max_table_size,
 										&info,
 										HASH_ELEM | HASH_BLOBS |
 										HASH_FIXED_SIZE);
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..ce9bb9c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
  * of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
  * This lets any subset of the hash buckets be treated as a separately
  * lockable partition.  We expect callers to use the low-order bits of a
  * lookup key's hash value as a partition number --- this will work because
@@ -87,6 +87,7 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "storage/lock.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -128,12 +129,26 @@ typedef HASHBUCKET *HASHSEGMENT;
  */
 struct HASHHDR
 {
-	/* In a partitioned table, take this lock to touch nentries or freeList */
-	slock_t		mutex;			/* unused if not partitioned table */
-
-	/* These fields change during entry addition/deletion */
-	long		nentries;		/* number of entries in hash table */
-	HASHELEMENT *freeList;		/* linked list of free elements */
+	/*
+	 * There are two fields declared below: nentries and freeList. nentries
+	 * stores current number of entries in a hash table. freeList is a linked
+	 * list of free elements.
+	 *
+	 * To keep these fields consistent in a partitioned table we need to
+	 * synchronize access to them using a spinlock. But it turned out that a
+	 * single spinlock can create a bottleneck. To prevent lock contention an
+	 * array of NUM_LOCK_PARTITIONS spinlocks is used. Each spinlock
+	 * corresponds to a single table partition (see PARTITION_IDX definition)
+	 * and protects one element of nentries and freeList arrays. Since
+	 * partitions are locked on a calling side depending on lower bits of a
+	 * hash value this particular number of spinlocks prevents deadlocks.
+	 *
+	 * If hash table is not partitioned only nentries[0] and freeList[0] are
+	 * used and spinlocks are not used at all.
+	 */
+	slock_t		mutex[NUM_LOCK_PARTITIONS];		/* array of spinlocks */
+	long		nentries[NUM_LOCK_PARTITIONS];	/* number of entries */
+	HASHELEMENT *freeList[NUM_LOCK_PARTITIONS]; /* lists of free elements */
 
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +181,8 @@ struct HASHHDR
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
 
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
 /*
  * Top control structure for a hashtable --- in a shared table, each backend
  * has its own copy (OK since no fields change at runtime)
@@ -219,10 +236,10 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
-static HASHBUCKET get_hash_entry(HTAB *hashp);
+static HASHBUCKET get_hash_entry(HTAB *hashp, int partition_idx);
 static void hdefault(HTAB *hashp);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
@@ -282,6 +299,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			i,
+				partitions_number,
+				nelem_alloc;
 
 	/*
 	 * For shared hash tables, we have a local hash header (HTAB struct) that
@@ -482,10 +502,24 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 	if ((flags & HASH_SHARED_MEM) ||
 		nelem < hctl->nelem_alloc)
 	{
-		if (!element_alloc(hashp, (int) nelem))
-			ereport(ERROR,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
+		/*
+		 * If hash table is partitioned all freeLists have equal number of
+		 * elements. Otherwise only freeList[0] is used.
+		 */
+		if (IS_PARTITIONED(hashp->hctl))
+			partitions_number = NUM_LOCK_PARTITIONS;
+		else
+			partitions_number = 1;
+
+		nelem_alloc = ((int) nelem) / partitions_number;
+		if (nelem_alloc == 0)
+			nelem_alloc = 1;
+
+		for (i = 0; i < partitions_number; i++)
+			if (!element_alloc(hashp, nelem_alloc, i))
+				ereport(ERROR,
+						(errcode(ERRCODE_OUT_OF_MEMORY),
+						 errmsg("out of memory")));
 	}
 
 	if (flags & HASH_FIXED_SIZE)
@@ -503,9 +537,6 @@ hdefault(HTAB *hashp)
 
 	MemSet(hctl, 0, sizeof(HASHHDR));
 
-	hctl->nentries = 0;
-	hctl->freeList = NULL;
-
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
 
@@ -572,12 +603,14 @@ init_htab(HTAB *hashp, long nelem)
 	HASHSEGMENT *segp;
 	int			nbuckets;
 	int			nsegs;
+	int			i;
 
 	/*
 	 * initialize mutex if it's a partitioned table
 	 */
 	if (IS_PARTITIONED(hctl))
-		SpinLockInit(&hctl->mutex);
+		for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+			SpinLockInit(&(hctl->mutex[i]));
 
 	/*
 	 * Divide number of elements by the fill factor to determine a desired
@@ -648,7 +681,7 @@ init_htab(HTAB *hashp, long nelem)
 			"HIGH MASK       ", hctl->high_mask,
 			"LOW  MASK       ", hctl->low_mask,
 			"NSEGS           ", hctl->nsegs,
-			"NENTRIES        ", hctl->nentries);
+			"NENTRIES        ", hash_get_num_entries(hctl));
 #endif
 	return true;
 }
@@ -769,7 +802,7 @@ hash_stats(const char *where, HTAB *hashp)
 			where, hashp->hctl->accesses, hashp->hctl->collisions);
 
 	fprintf(stderr, "hash_stats: entries %ld keysize %ld maxp %u segmentcount %ld\n",
-			hashp->hctl->nentries, (long) hashp->hctl->keysize,
+			hash_get_num_entries(hashp), (long) hashp->hctl->keysize,
 			hashp->hctl->max_bucket, hashp->hctl->nsegs);
 	fprintf(stderr, "%s: total accesses %ld total collisions %ld\n",
 			where, hash_accesses, hash_collisions);
@@ -863,6 +896,7 @@ hash_search_with_hash_value(HTAB *hashp,
 	HASHBUCKET	currBucket;
 	HASHBUCKET *prevBucketPtr;
 	HashCompareFunc match;
+	int			partition_idx = PARTITION_IDX(hctl, hashvalue);
 
 #if HASH_STATISTICS
 	hash_accesses++;
@@ -885,7 +919,7 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * order of these tests is to try to check cheaper conditions first.
 		 */
 		if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
-			hctl->nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
+		hctl->nentries[0] / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
 	}
@@ -943,20 +977,20 @@ hash_search_with_hash_value(HTAB *hashp,
 			{
 				/* if partitioned, must lock to touch nentries and freeList */
 				if (IS_PARTITIONED(hctl))
-					SpinLockAcquire(&hctl->mutex);
+					SpinLockAcquire(&(hctl->mutex[partition_idx]));
 
-				Assert(hctl->nentries > 0);
-				hctl->nentries--;
+				Assert(hctl->nentries[partition_idx] > 0);
+				hctl->nentries[partition_idx]--;
 
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the freelist for this table.  */
-				currBucket->link = hctl->freeList;
-				hctl->freeList = currBucket;
+				currBucket->link = hctl->freeList[partition_idx];
+				hctl->freeList[partition_idx] = currBucket;
 
 				if (IS_PARTITIONED(hctl))
-					SpinLockRelease(&hctl->mutex);
+					SpinLockRelease(&hctl->mutex[partition_idx]);
 
 				/*
 				 * better hope the caller is synchronizing access to this
@@ -982,7 +1016,7 @@ hash_search_with_hash_value(HTAB *hashp,
 				elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
 					 hashp->tabname);
 
-			currBucket = get_hash_entry(hashp);
+			currBucket = get_hash_entry(hashp, partition_idx);
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1175,41 +1209,71 @@ hash_update_hash_key(HTAB *hashp,
  * create a new entry if possible
  */
 static HASHBUCKET
-get_hash_entry(HTAB *hashp)
+get_hash_entry(HTAB *hashp, int partition_idx)
 {
-	HASHHDR *hctl = hashp->hctl;
+	HASHHDR    *hctl = hashp->hctl;
 	HASHBUCKET	newElement;
+	int			i,
+				borrow_from_idx;
 
 	for (;;)
 	{
 		/* if partitioned, must lock to touch nentries and freeList */
 		if (IS_PARTITIONED(hctl))
-			SpinLockAcquire(&hctl->mutex);
+			SpinLockAcquire(&hctl->mutex[partition_idx]);
 
 		/* try to get an entry from the freelist */
-		newElement = hctl->freeList;
+		newElement = hctl->freeList[partition_idx];
+
 		if (newElement != NULL)
-			break;
+		{
+			/* remove entry from freelist, bump nentries */
+			hctl->freeList[partition_idx] = newElement->link;
+			hctl->nentries[partition_idx]++;
+			if (IS_PARTITIONED(hctl))
+				SpinLockRelease(&hctl->mutex[partition_idx]);
+
+			return newElement;
+		}
 
-		/* no free elements.  allocate another chunk of buckets */
 		if (IS_PARTITIONED(hctl))
-			SpinLockRelease(&hctl->mutex);
+			SpinLockRelease(&hctl->mutex[partition_idx]);
 
-		if (!element_alloc(hashp, hctl->nelem_alloc))
+		/* no free elements.  allocate another chunk of buckets */
+		if (!element_alloc(hashp, hctl->nelem_alloc, partition_idx))
 		{
-			/* out of memory */
-			return NULL;
-		}
-	}
+			if (!IS_PARTITIONED(hctl))
+				return NULL;	/* out of memory */
 
-	/* remove entry from freelist, bump nentries */
-	hctl->freeList = newElement->link;
-	hctl->nentries++;
+			/* try to borrow element from another partition */
+			borrow_from_idx = partition_idx;
+			for (;;)
+			{
+				borrow_from_idx = (borrow_from_idx + 1) % NUM_LOCK_PARTITIONS;
+				if (borrow_from_idx == partition_idx)
+					break;
 
-	if (IS_PARTITIONED(hctl))
-		SpinLockRelease(&hctl->mutex);
+				SpinLockAcquire(&(hctl->mutex[borrow_from_idx]));
+				newElement = hctl->freeList[borrow_from_idx];
+
+				if (newElement != NULL)
+				{
+					hctl->freeList[borrow_from_idx] = newElement->link;
+					SpinLockRelease(&(hctl->mutex[borrow_from_idx]));
+
+					SpinLockAcquire(&hctl->mutex[partition_idx]);
+					hctl->nentries[partition_idx]++;
+					SpinLockRelease(&hctl->mutex[partition_idx]);
+
+					break;
+				}
 
-	return newElement;
+				SpinLockRelease(&(hctl->mutex[borrow_from_idx]));
+			}
+
+			return newElement;
+		}
+	}
 }
 
 /*
@@ -1218,11 +1282,21 @@ get_hash_entry(HTAB *hashp)
 long
 hash_get_num_entries(HTAB *hashp)
 {
+	int			i;
+	long		sum = hashp->hctl->nentries[0];
+
 	/*
 	 * We currently don't bother with the mutex; it's only sensible to call
 	 * this function if you've got lock on all partitions of the table.
 	 */
-	return hashp->hctl->nentries;
+
+	if (!IS_PARTITIONED(hashp->hctl))
+		return sum;
+
+	for (i = 1; i < NUM_LOCK_PARTITIONS; i++)
+		sum += hashp->hctl->nentries[i];
+
+	return sum;
 }
 
 /*
@@ -1530,9 +1604,9 @@ seg_alloc(HTAB *hashp)
  * allocate some new elements and link them into the free list
  */
 static bool
-element_alloc(HTAB *hashp, int nelem)
+element_alloc(HTAB *hashp, int nelem, int partition_idx)
 {
-	HASHHDR *hctl = hashp->hctl;
+	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
 	HASHELEMENT *firstElement;
 	HASHELEMENT *tmpElement;
@@ -1563,14 +1637,14 @@ element_alloc(HTAB *hashp, int nelem)
 
 	/* if partitioned, must lock to touch freeList */
 	if (IS_PARTITIONED(hctl))
-		SpinLockAcquire(&hctl->mutex);
+		SpinLockAcquire(&hctl->mutex[partition_idx]);
 
 	/* freelist could be nonempty if two backends did this concurrently */
-	firstElement->link = hctl->freeList;
-	hctl->freeList = prevElement;
+	firstElement->link = hctl->freeList[partition_idx];
+	hctl->freeList[partition_idx] = prevElement;
 
 	if (IS_PARTITIONED(hctl))
-		SpinLockRelease(&hctl->mutex);
+		SpinLockRelease(&hctl->mutex[partition_idx]);
 
 	return true;
 }
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..eb4467a 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -137,7 +137,12 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
 			(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
+#define MaxPackedIndexTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+			(sizeof(ItemPointerData))))
+// #define MaxIndexTuplesPerPage	\
+// 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+// 			(sizeof(ItemPointerData))))
 
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..41e407d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -75,6 +75,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_POSTING (1 << 8)		/* page contains compressed duplicates (only for leaf pages) */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -181,6 +182,8 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
 #define P_HAS_GARBAGE(opaque)	((opaque)->btpo_flags & BTP_HAS_GARBAGE)
 #define P_INCOMPLETE_SPLIT(opaque)	((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+#define P_HAS_POSTING(opaque)		((opaque)->btpo_flags & BTP_HAS_POSTING)
+
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -538,6 +541,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for Posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -550,7 +555,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -651,6 +656,28 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+	BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+	((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup)			BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n)		ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup)		(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n)	ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup)    	(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup)			(ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n)		(ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -715,8 +742,8 @@ extern BTStack _bt_search(Relation rel,
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
 			  int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+								ScanKey scankey, bool nextkey, bool* updposting);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
 			Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +774,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 5e8825e..177371b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,13 +128,19 @@ extern char *MainLWLockNames[];
  * having this file include lock.h or bufmgr.h would be backwards.
  */
 
-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  128
-
-/* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS  4
+/*
+ * Number of partitions the shared lock tables are divided into.
+ *
+ * This particular number of partitions significantly reduces lock contention
+ * in partitioned hash tables, almost if partitioned tables didn't use any
+ * locking at all.
+ */
+#define LOG2_NUM_LOCK_PARTITIONS  7
 #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
 
+/* Number of partitions of the shared buffer mapping hashtable */
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+
 /* Number of partitions the shared predicate lock tables are divided into */
 #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
 #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 6468e66..50cf928 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -37,7 +37,7 @@ extern void InitShmemAllocation(void);
 extern void *ShmemAlloc(Size size);
 extern bool ShmemAddrIsValid(const void *addr);
 extern void InitShmemIndex(void);
-extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
+extern HTAB *ShmemInitHash(const char *name, long max_size,
 			  HASHCTL *infoP, int hash_flags);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
 extern Size add_size(Size s1, Size s2);

#12

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#11)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 28 January 2016 at 16:12, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in presentation
(part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way
as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But
I haven't tested it properly yet.

Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but
on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets.
Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up
past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point. Your
results show a sweet spot at around 1000 / 10000000, with it getting
slightly worse beyond that. I kind of expected a lot of efficiency where
all the values are the same, but perhaps that's due to my lack of
understanding regarding the way they're being stored.

Thom

#13

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Thom Brown (#12)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote:

I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

I think that you'd need an I/O bound workload to see significant
benefits. That seems unsurprising. I believe that random I/O from
index writes is a big problem for us.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Peter Geoghegan (#13)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 28 January 2016 at 17:09, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote:

I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

I think that you'd need an I/O bound workload to see significant
benefits. That seems unsurprising. I believe that random I/O from
index writes is a big problem for us.

I was thinking more from the point of view of the index size. An
index containing 10 million duplicate values is around 40% of the size
of an index with 10 million unique values.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Thom Brown (#12)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 28 January 2016 at 17:03, Thom Brown <thom@linux.com> wrote:

On 28 January 2016 at 16:12, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in presentation
(part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous
splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same
way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But
I haven't tested it properly yet.

Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions,
but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real
datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up
past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point. Your
results show a sweet spot at around 1000 / 10000000, with it getting
slightly worse beyond that. I kind of expected a lot of efficiency where
all the values are the same, but perhaps that's due to my lack of
understanding regarding the way they're being stored.

Okay, now for some badness. I've restored a database containing 2 tables,
one 318MB, another 24kB. The 318MB table contains 5 million rows with a
sequential id column. I get a problem if I try to delete many rows from it:

# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

The query completes, but I get this message a lot before it does.

This happens even if I drop the primary key and foreign key constraints, so
somehow the memory usage has massively increased with this patch.

Thom

#16

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Thom Brown (#12)

Re: [WIP] Effective storage of duplicates in B-tree index.

28.01.2016 20:03, Thom Brown:

On 28 January 2016 at 16:12, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru
<mailto:a.lubennikova@postgrespro.ru>> wrote:

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate
keys in B-tree index.
The main idea is to implement posting lists and posting
trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is
organised as follows.
Index stores single index tuple for each unique key. That
index tuple points to posting list which contains pointers
to heap tuples (TIDs). If too many rows having the same key,
multiple pages are allocated for the TIDs and these
constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to
posting list/tree and significantly decrease index size.
Read more in presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table
row that we index.
It can possibly cause page split. Because of MVCC even
unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid
superfluous splits.

I'd like to share the progress of my work. So here is a WIP
patch.
It provides effective duplicate handling using posting lists
the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) +
key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID
(ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID
(ip_blkid, ip_posid)

It seems that backward compatibility works well without any
changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test
functions test_btbuild and test_ginbuild, which you can find
in attached sql file.
i - number of distinct values in the index. So i=1 means that
all rows have the same key, and i=10000000 means that all
keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes
for B-tree, which is quite logical - there is no compression
if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list
compression yet. So there is still potential to decrease size
of compressed btree.

I'm almost sure, there are still some tiny bugs and missed
functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some
real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside
the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral
bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the
near future:
1. Add storage_parameter 'enable_compression' for btree
access method which specifies whether the index handles
duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly
slower with compressed btree, which is obviously not what we
do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you
caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point.
Your results show a sweet spot at around 1000 / 10000000, with it
getting slightly worse beyond that. I kind of expected a lot of
efficiency where all the values are the same, but perhaps that's due
to my lack of understanding regarding the way they're being stored.

Thank you for the prompt reply. I see what you're confused about. I'll
try to clarify it.

First of all, what is implemented in the patch is not actually
compression. It's more about index page layout changes to compact
ItemPointers (TIDs).
Instead of TID+key, TID+key, we store now META+key+List_of_TIDs (also
known as Posting list).

before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

TID (N item pointers, posting list offset) - this is the meta
information. So, we have to store this meta information in addition to
useful data.

Next point is the requirement of having minimum three tuples in a page.
We need at least two tuples to point the children and the highkey as well.
This requirement leads to the limitation of the max index tuple size.

/*
* Maximum size of a btree index entry, including its tuple header.
*
* We actually need to be able to fit three items on every page,
* so restrict any one item to 1/3 the per-page available space.
*/
#define BTMaxItemSize(page) \
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)

Although, I thought just now that this size could be increased for
compressed tuples, at least for leaf pages.

That's the reason, why we have to store more meta information than meets
the eye.

For example, we have 100000 of duplicates with the same key. It seems
that compression should be really significant.
Something like 1 Meta + 1 key instead of 100000 keys --> 6 bytes (size
of meta TID) + keysize instead of 600000.
But, we have to split one huge posting list into the smallest ones to
fit it into the index page.

It depends on the key size, of course. As I can see form pageisnpect the
index on the single integer key have to split the tuples into the pieces
with the size 2704 containing 447 TIDs in one posting list.
So we have 1 Meta + 1 key instead of 447 keys. As you can see, that is
really less impressive than expected.

There is an idea of posting trees in GIN. Key is stored just once, and
posting list which doesn't fit into the page becomes a tree.
You can find incredible article about it here
http://www.cybertec.at/2013/03/gin-just-an-index-type/
But I think, that it's not the best way for the btree am, because it’s
not supposed to handle concurrent insertions.

As I mentioned before I'm going to implement prefix compression of
posting list, which must be efficient and quite simple, since it's
already implemented in GIN. You can find the presentation about it here
https://www.pgcon.org/2014/schedule/events/698.en.html

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#17

Aleksander Alekseev

a.alekseev@postgrespro.ru

almost 10 years ago

In reply to: Anastasia Lubennikova (#16)

Re: [WIP] Effective storage of duplicates in B-tree index.

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Aleksander Alekseev (#17)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 ms

There were 358 lines of that warning message. I don't get these
messages without the patch.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Thom Brown (#18)

1 attachment(s)

Re: [WIP] Effective storage of duplicates in B-tree index.

29.01.2016 19:01, Thom Brown:

On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

Thank you for testing. I rechecked that, and insertions are really very
very very slow. It seems like a bug.

Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 ms

There were 358 lines of that warning message. I don't get these
messages without the patch.

Thom

Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

btree_compression_2.0.patchtext/x-patch; name=btree_compression_2.0.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..3908cc1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
 
 typedef struct
 {
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
 				  ScanKey scankey,
 				  IndexTuple newtup,
 				  BTStack stack,
-				  Relation heapRel);
+				  Relation heapRel,
+				  bool *updposing);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	bool updposting = false;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -162,8 +165,9 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		bool fakeupdposting = false; /* Never update posting in unique index */
 
-		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+		offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, &fakeupdposting);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -200,8 +204,54 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+						  stack, heapRel, &updposting);
+
+		if (IsSystemRelation(rel))
+			updposting = false;
+
+		/*
+		 * New tuple has the same key with tuple at the page.
+		 * Unite them into one posting.
+		 */
+		if (updposting)
+		{
+			Page		page;
+			IndexTuple olditup, newitup;
+			ItemPointerData *ipd;
+			int nipd;
+
+			page = BufferGetPage(buf);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if (BtreeTupleIsPosting(olditup))
+				nipd = BtreeGetNPosting(olditup);
+			else
+				nipd = 1;
+
+			ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+			/* copy item pointers from old tuple into ipd */
+			if (BtreeTupleIsPosting(olditup))
+				memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+			else
+				memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+			/* add item pointer of the new tuple into ipd */
+			memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+			/*
+			 * Form posting tuple, then delete old tuple and insert posting tuple.
+			 */
+			newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+			PageIndexTupleDelete(page, offset);
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, newitup, offset, false);
+
+			pfree(ipd);
+			pfree(newitup);
+		}
+		else
+		{
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		}
 	}
 	else
 	{
@@ -306,6 +356,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+
+				Assert (!BtreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -535,7 +587,8 @@ _bt_findinsertloc(Relation rel,
 				  ScanKey scankey,
 				  IndexTuple newtup,
 				  BTStack stack,
-				  Relation heapRel)
+				  Relation heapRel,
+				  bool *updposting)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
@@ -681,7 +734,7 @@ _bt_findinsertloc(Relation rel,
 	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
 		newitemoff = firstlegaloff;
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false, updposting);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -1042,6 +1095,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
+		Assert(!BtreeTupleIsPosting(item));
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1128,40 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+		/*
+		 * Truncate posting before insert it as a hikey.
+		 */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 					false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+						false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
 	}
 	leftoff = OffsetNumberNext(leftoff);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..f56c90f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -75,6 +75,9 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 			 BlockNumber orig_blkno);
 
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+					  int nitem, int *nremaining);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +965,7 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTupleData *remaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1015,62 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
-
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if(BtreeTupleIsPosting(itup))
+				{
+					int nipd, nnewipd;
+					ItemPointer newipd;
+
+					nipd = BtreeGetNPosting(itup);
+					newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+					if (newipd != NULL)
+					{
+						if (nnewipd > 0)
+						{
+							/* There are still some live tuples in the posting.
+							 * 1) form new posting tuple, that contains remaining ipds
+							 * 2) delete "old" posting
+							 * 3) insert new posting back to the page
+							 */
+							remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+							PageIndexTupleDelete(page, offnum);
+
+							if (PageAddItem(page, (Item) remaining, IndexTupleSize(remaining), offnum, false, false) != offnum)
+								elog(ERROR, "failed to add vacuumed posting tuple to index page in \"%s\"",
+										RelationGetRelationName(info->index));
+						}
+						else
+							deletable[ndeletable++] = offnum;
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1160,3 +1195,51 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+
+/*
+ * Vacuums a posting list. The size of the list must be specified
+ * via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+					  int nitem, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	ItemPointer tmpitems = NULL;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+
+	/*
+	 * Iterate over TIDs array
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (callback(items + i, callback_state))
+		{
+			if (!tmpitems)
+			{
+				/*
+				 * First TID to be deleted: allocate memory to hold the
+				 * remaining items.
+				 */
+				tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+			}
+		}
+		else
+		{
+			if (tmpitems)
+				tmpitems[remaining] = items[i];
+			remaining++;
+		}
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..0428f04 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -90,6 +92,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		   Buffer *bufP, int access)
 {
 	BTStack		stack_in = NULL;
+	bool fakeupdposting = false; /* fake variable for _bt_binsrch */
 
 	/* Get the root page to start with */
 	*bufP = _bt_getroot(rel, access);
@@ -136,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey, &fakeupdposting);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
@@ -310,7 +313,8 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
-			bool nextkey)
+			bool nextkey,
+			bool *updposing)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
 	 * scan key), which could be the last slot + 1.
 	 */
 	if (P_ISLEAF(opaque))
+	{
+		if (low <= PageGetMaxOffsetNumber(page))
+		{
+			IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low));
+			/* one excessive check of equality. for possible posting tuple update or creation */
+			if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+				&& (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page)))
+				*updposing = true;
+		}
 		return low;
+	}
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -536,6 +550,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	int			i;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	bool fakeupdposing = false; /* fake variable for _bt_binsrch */
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -1003,7 +1018,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	so->markItemIndex = -1;		/* ditto */
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey, &fakeupdposing);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1161,6 +1176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	IndexTuple	itup;
 	bool		continuescan;
+	int i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1211,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1232,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
 			}
 			if (!continuescan)
 			{
@@ -1228,7 +1256,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			offnum = OffsetNumberNext(offnum);
 		}
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1236,7 +1264,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPackedIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1246,8 +1274,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1261,8 +1301,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1275,6 +1315,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert (!BtreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1288,6 +1330,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BtreeGetPostingOffset(itup);
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e29d63f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,6 +75,7 @@
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
+#include "catalog/catalog.h"
 
 
 /*
@@ -527,15 +528,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(last_off > P_FIRSTKEY);
 		ii = PageGetItemId(opage, last_off);
 		oitup = (IndexTuple) PageGetItem(opage, ii);
-		_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
 
 		/*
-		 * Move 'last' into the high key position on opage
+		 * If the item is PostingTuple, we can cut it.
+		 * Because HIKEY is not considered as real data, and it needn't to keep any ItemPointerData at all.
+		 * And of course it needn't to keep a list of ipd.
+		 * But, if it had a big posting list, there will be plenty of free space on the opage.
+		 * So we must split Posting tuple into 2 pieces.
 		 */
-		hii = PageGetItemId(opage, P_HIKEY);
-		*hii = *ii;
-		ItemIdSetUnused(ii);	/* redundant */
-		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		 if (BtreeTupleIsPosting(oitup))
+		 {
+			int nipd, ntocut, ntoleave;
+			Size keytupsz;
+			IndexTuple keytup;
+			nipd = BtreeGetNPosting(oitup);
+			ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+			ntocut++; /* round up to be sure that we cut enough */
+			ntoleave = nipd - ntocut;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(oitup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, oitup, keytupsz);
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+			if (ntocut < nipd)
+			{
+				ItemPointerData *newipd;
+				IndexTuple newitup, newlasttup;
+				/*
+				 * 1) Cut part of old tuple to shift to npage.
+				 * And insert it as P_FIRSTKEY.
+				 * This tuple is based on keytup.
+				 * Blkno & offnum are reset in BtreeFormPackedTuple.
+				 */
+				newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+				/* Note, that we cut last 'ntocut' items */
+				memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+				newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+				_bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+				pfree(newipd);
+				pfree(newitup);
+
+				/*
+				 * 2) set last item to the P_HIKEY linp
+				 * Move 'last' into the high key position on opage
+				 * NOTE: Do this because of indextuple deletion algorithm, which
+				 * doesn't allow to delete an item while we have unused one before it.
+				 */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key */
+				PageIndexTupleDelete(opage, P_HIKEY);
+
+				/* 4)Insert keytup as P_HIKEY. */
+				_bt_sortaddtup(opage, IndexTupleSize(keytup), keytup,  P_HIKEY);
+
+				/* 5) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+				newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+				_bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+				pfree(newlasttup);
+			}
+			else
+			{
+				/* The tuple isn't big enough to split it. Handle it as a normal tuple. */
+
+				/*
+				 * 1) Shift the last tuple to npage.
+				 * Insert it as P_FIRSTKEY.
+				 */
+				_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+				/* 2) set last item to the P_HIKEY linp */
+				/* Move 'last' into the high key position on opage */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key */
+				PageIndexTupleDelete(opage, P_HIKEY);
+
+				/* 4)Insert keytup as P_HIKEY. */
+				_bt_sortaddtup(opage, IndexTupleSize(keytup), keytup,  P_HIKEY);
+
+			}
+			pfree(keytup);
+		 }
+		 else
+		 {
+			/*
+			 * 1) Shift the last tuple to npage.
+			 * Insert it as P_FIRSTKEY.
+			 */
+			_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+			/* 2) set last item to the P_HIKEY linp */
+			/* Move 'last' into the high key position on opage */
+			hii = PageGetItemId(opage, P_HIKEY);
+			*hii = *ii;
+			ItemIdSetUnused(ii);	/* redundant */
+			((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		}
 
 		/*
 		 * Link the old page into its parent, using its minimum key. If we
@@ -547,6 +653,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 		Assert(state->btps_minkey != NULL);
 		ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
 		_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
 		pfree(state->btps_minkey);
 
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * it off the old page, not the new one, in case we are not at leaf
 		 * level.
 		 */
-		state->btps_minkey = CopyIndexTuple(oitup);
+		ItemId iihk = PageGetItemId(opage, P_HIKEY);
+		IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+		state->btps_minkey = CopyIndexTuple(hikey);
 
 		/*
 		 * Set the sibling links for both pages.
@@ -590,7 +699,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
+
+		if (BtreeTupleIsPosting(itup))
+		{
+			Size keytupsz;
+			IndexTuple keytup;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+			state->btps_minkey = CopyIndexTuple(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);
 	}
 
 	/*
@@ -670,6 +801,67 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+	/* Prepare SortSupport data for each column */
+	ScanKey		indexScanKey = _bt_mkscankey_nodata(wstate->index);
+	SortSupport sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+	int i;
+
+	for (i = 0; i < keysz; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+		ScanKey		scanKey = indexScanKey + i;
+		int16		strategy;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = scanKey->sk_collation;
+		sortKey->ssup_nulls_first =
+			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+		sortKey->ssup_attno = scanKey->sk_attno;
+		/* Abbreviation is not supported here */
+		sortKey->abbreviate = false;
+
+		AssertState(sortKey->ssup_attno != 0);
+
+		strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+			BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+		PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+	}
+
+	_bt_freeskey(indexScanKey);
+	return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey i
+ */
+int _bt_call_comparator(SortSupport sortKeys, int i,
+						 IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+		SortSupport entry;
+		Datum		attrDatum1,
+					attrDatum2;
+		bool		isNull1,
+					isNull2;
+		int32		compare;
+
+		entry = sortKeys + i - 1;
+		attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+		attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+		compare = ApplySortComparator(attrDatum1, isNull1,
+										attrDatum2, isNull2,
+										entry);
+
+		return compare;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -679,16 +871,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
-				itup2 = NULL;
+				itup2 = NULL,
+				itupprev = NULL;
 	bool		should_free,
 				should_free2,
 				load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = RelationGetNumberOfAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
+	int			ntuples = 0;
 	SortSupport sortKeys;
 
+	/* Prepare SortSupport data */
+	sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
 	if (merge)
 	{
 		/*
@@ -701,34 +897,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 									   true, &should_free);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate,
 										true, &should_free2);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
-		/* Prepare SortSupport data for each column */
-		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
-		for (i = 0; i < keysz; i++)
-		{
-			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
-			int16		strategy;
-
-			sortKey->ssup_cxt = CurrentMemoryContext;
-			sortKey->ssup_collation = scanKey->sk_collation;
-			sortKey->ssup_nulls_first =
-				(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
-			sortKey->ssup_attno = scanKey->sk_attno;
-			/* Abbreviation is not supported here */
-			sortKey->abbreviate = false;
-
-			AssertState(sortKey->ssup_attno != 0);
-
-			strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
-				BTGreaterStrategyNumber : BTLessStrategyNumber;
-
-			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
-		}
-
-		_bt_freeskey(indexScanKey);
 
 		for (;;)
 		{
@@ -742,20 +910,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			{
 				for (i = 1; i <= keysz; i++)
 				{
-					SortSupport entry;
-					Datum		attrDatum1,
-								attrDatum2;
-					bool		isNull1,
-								isNull2;
-					int32		compare;
-
-					entry = sortKeys + i - 1;
-					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
-					attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
-					compare = ApplySortComparator(attrDatum1, isNull1,
-												  attrDatum2, isNull2,
-												  entry);
+					int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
 					if (compare > 0)
 					{
 						load1 = false;
@@ -794,19 +950,137 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	else
 	{
 		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+
+		Relation indexRelation = wstate->index;
+		Form_pg_index index = indexRelation->rd_index;
+
+		if (index->indisunique)
+		{
+			/* Do not use compression for unique indexes. */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
+
+				_bt_buildadd(wstate, state, itup);
+				if (should_free)
+					pfree(itup);
+			}
+		}
+		else
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			ItemPointerData *ipd = NULL;
+			IndexTuple 		postingtuple;
+			Size			maxitemsize = 0,
+							maxpostingsize = 0;
+			int32 			compare = 0;
 
-			_bt_buildadd(wstate, state, itup);
-			if (should_free)
-				pfree(itup);
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				/*
+				 * Compare current tuple with previous one.
+				 * If tuples are equal, we can unite them into a posting list.
+				 */
+				if (itupprev != NULL)
+				{
+					/* compare tuples */
+					compare = 0;
+					for (i = 1; i <= keysz; i++)
+					{
+						compare = _bt_call_comparator(sortKeys, i, itup, itupprev, tupdes);
+						if (compare != 0)
+							break;
+					}
+
+					if (compare == 0)
+					{
+						/* Tuples are equal. Create or update posting */
+						if (ntuples == 0)
+						{
+							/*
+							 * We haven't suitable posting list yet, so allocate
+							 * it and save both itupprev and current tuple.
+							 */
+
+							ipd = palloc0(maxitemsize);
+
+							memcpy(ipd, itupprev, sizeof(ItemPointerData));
+							ntuples++;
+							memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+							ntuples++;
+						}
+						else
+						{
+							if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+							{
+								memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+								ntuples++;
+							}
+							else
+							{
+								postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+								_bt_buildadd(wstate, state, postingtuple);
+								ntuples = 0;
+								pfree(ipd);
+							}
+						}
+
+					}
+					else
+					{
+						/* Tuples aren't equal. Insert itupprev into index. */
+						if (ntuples == 0)
+							_bt_buildadd(wstate, state, itupprev);
+						else
+						{
+							postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+							_bt_buildadd(wstate, state, postingtuple);
+							ntuples = 0;
+							pfree(ipd);
+						}
+					}
+				}
+
+				/*
+				 * Copy the tuple into temp variable itupprev
+				 * to compare it with the following tuple
+				 * and maybe unite them into a posting tuple
+				 */
+				itupprev = CopyIndexTuple(itup);
+				if (should_free)
+					pfree(itup);
+
+				/* compute max size of ipd list */
+				maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+			}
+
+			/* Handle the last item.*/
+			if (ntuples == 0)
+			{
+				if (itupprev != NULL)
+					_bt_buildadd(wstate, state, itupprev);
+			}
+			else
+			{
+				Assert(ipd!=NULL);
+				Assert(itupprev != NULL);
+				postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+				_bt_buildadd(wstate, state, postingtuple);
+				ntuples = 0;
+				pfree(ipd);
+			}
 		}
 	}
-
 	/* Close down final pages and write the metapage */
 	_bt_uppershutdown(wstate, state);
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..0291342 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BtreeTupleIsPosting(ituple)
+				&& (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2063,3 +2065,71 @@ btoptions(Datum reloptions, bool validate)
 {
 	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
 }
+
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int i;
+	uint32		newsize;
+	IndexTuple itup = CopyIndexTuple(tuple);
+
+	/*
+	 * Determine and store offset to the posting list.
+	 */
+	newsize = IndexTupleSize(itup);
+	newsize = SHORTALIGN(newsize);
+
+	/*
+	 * Set meta info about the posting list.
+	 */
+	BtreeSetPostingOffset(itup, newsize);
+	BtreeSetNPosting(itup, nipd);
+	/*
+	 * Add space needed for posting list, if any.  Then check that the tuple
+	 * won't be too big to store.
+	 */
+	newsize += sizeof(ItemPointerData)*nipd;
+	newsize = MAXALIGN(newsize);
+
+	/*
+	 * Resize tuple if needed
+	 */
+	if (newsize != IndexTupleSize(itup))
+	{
+		itup = repalloc(itup, newsize);
+
+		/*
+		 * PostgreSQL 9.3 and earlier did not clear this new space, so we
+		 * might find uninitialized padding when reading tuples from disk.
+		 */
+		memset((char *) itup + IndexTupleSize(itup),
+			   0, newsize - IndexTupleSize(itup));
+		/* set new size in tuple header */
+		itup->t_info &= ~INDEX_SIZE_MASK;
+		itup->t_info |= newsize;
+	}
+
+	/*
+	 * Copy data into the posting tuple
+	 */
+	memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+	return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int size;
+	if (BtreeTupleIsPosting(tuple))
+	{
+		size = BtreeGetPostingOffset(tuple);
+		tuple->t_info &= ~INDEX_SIZE_MASK;
+		tuple->t_info |= size;
+	}
+
+	return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..eb4467a 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -137,7 +137,12 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
 			(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
+#define MaxPackedIndexTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+			(sizeof(ItemPointerData))))
+// #define MaxIndexTuplesPerPage	\
+// 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+// 			(sizeof(ItemPointerData))))
 
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..41e407d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -75,6 +75,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_POSTING (1 << 8)		/* page contains compressed duplicates (only for leaf pages) */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -181,6 +182,8 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
 #define P_HAS_GARBAGE(opaque)	((opaque)->btpo_flags & BTP_HAS_GARBAGE)
 #define P_INCOMPLETE_SPLIT(opaque)	((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+#define P_HAS_POSTING(opaque)		((opaque)->btpo_flags & BTP_HAS_POSTING)
+
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -538,6 +541,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for Posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -550,7 +555,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -651,6 +656,28 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+	BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+	((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup)			BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n)		ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup)		(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n)	ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup)    	(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup)			(ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n)		(ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -715,8 +742,8 @@ extern BTStack _bt_search(Relation rel,
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
 			  int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+								ScanKey scankey, bool nextkey, bool* updposting);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
 			Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +774,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c

#20

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#19)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 29 January 2016 at 16:50, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

29.01.2016 19:01, Thom Brown:

On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

Thank you for testing. I rechecked that, and insertions are really very very
very slow. It seems like a bug.

Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 ms

There were 358 lines of that warning message. I don't get these
messages without the patch.

Thom

Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).

Thanks. Well I've tested this latest patch, and the warnings are no
longer generated. However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Thom Brown (#20)

Re: [WIP] Effective storage of duplicates in B-tree index.

29.01.2016 20:43, Thom Brown:

On 29 January 2016 at 16:50, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

29.01.2016 19:01, Thom Brown:

On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

Thank you for testing. I rechecked that, and insertions are really very very
very slow. It seems like a bug.

Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 ms

There were 358 lines of that warning message. I don't get these
messages without the patch.

Thom

Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).

Thanks. Well I've tested this latest patch, and the warnings are no
longer generated. However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.

Huh, this patch seems to be enchanted) It works fine for me. Did you
perform "make distclean"?
Anyway, I'll send a new version soon.
I just write here to say that I do not disappear and I do remember about
the issue.
I even almost fixed the insert speed problem. But I'm very very busy
this week.
I'll send an updated patch next week as soon as possible.

Thank you for attention to this work.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#21)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 2 February 2016 at 11:47, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

29.01.2016 20:43, Thom Brown:

On 29 January 2016 at 16:50, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

29.01.2016 19:01, Thom Brown:

On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower
after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

Thank you for testing. I rechecked that, and insertions are really very
very
very slow. It seems like a bug.

Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 ms

There were 358 lines of that warning message. I don't get these
messages without the patch.

Thom

Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).

Thanks. Well I've tested this latest patch, and the warnings are no
longer generated. However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.

Huh, this patch seems to be enchanted) It works fine for me. Did you perform
"make distclean"?

Yes. Just tried it again:

git clean -fd
git stash
make distclean
patch -p1 < ~/Downloads/btree_compression_2.0.patch
../dopg.sh (script I've always used to build with)
pg_ctl start
createdb pgbench
pgbench -i -s 100 pgbench

$ psql pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5488/pgbench

Previously, this would show an index size of 87MB for pgbench_accounts_pkey.

Anyway, I'll send a new version soon.
I just write here to say that I do not disappear and I do remember about the
issue.
I even almost fixed the insert speed problem. But I'm very very busy this
week.
I'll send an updated patch next week as soon as possible.

Thanks.

Thank you for attention to this work.

Thanks for your awesome patches.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Thom Brown (#22)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote:

public | pgbench_accounts_pkey | index | thom | pgbench_accounts | 214 MB |
public | pgbench_branches_pkey | index | thom | pgbench_branches | 24 kB |
public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB |

I see the same.

I use my regular SQL query to see the breakdown of leaf/internal/root pages:

postgres=# with tots as (
SELECT count(*) c,
avg(live_items) avg_live_items,
avg(dead_items) avg_dead_items,
u.type,
r.oid
from (select c.oid,
c.relpages,
generate_series(1, c.relpages - 1) i
from pg_index i
join pg_opclass op on i.indclass[0] = op.oid
join pg_am am on op.opcmethod = am.oid
join pg_class c on i.indexrelid = c.oid
where am.amname = 'btree') r,
lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
group by r.oid, type)
select ct.relname table_name,
tots.oid::regclass::text index_name,
(select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
upper(type) page_type,
c npages,
to_char(avg_live_items, '990.999'),
to_char(avg_dead_items, '990.999'),
to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
%' as prop_of_index
from tots
join pg_index i on i.indexrelid = tots.oid
join pg_class ct on ct.oid = i.indrelid
where tots.oid = 'pgbench_accounts_pkey'::regclass
order by ct.relnamespace, table_name, index_name, npages, type;
table_name │ index_name │ non_meta_pages │ page_type
│ npages │ to_char │ to_char │ prop_of_index
──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
│ 1 │ 97.000 │ 0.000 │ 0.004 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
│ 97 │ 282.670 │ 0.000 │ 0.354 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
│ 27,323 │ 366.992 │ 0.000 │ 99.643 %
(3 rows)

But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).

Are you sure you have the details right, Thom?
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Thom Brown

thom@linux.com

almost 10 years ago

In reply to: Peter Geoghegan (#23)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 4 February 2016 at 15:07, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote:

public | pgbench_accounts_pkey | index | thom | pgbench_accounts | 214 MB |
public | pgbench_branches_pkey | index | thom | pgbench_branches | 24 kB |
public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB |

I see the same.

I use my regular SQL query to see the breakdown of leaf/internal/root pages:

postgres=# with tots as (
SELECT count(*) c,
avg(live_items) avg_live_items,
avg(dead_items) avg_dead_items,
u.type,
r.oid
from (select c.oid,
c.relpages,
generate_series(1, c.relpages - 1) i
from pg_index i
join pg_opclass op on i.indclass[0] = op.oid
join pg_am am on op.opcmethod = am.oid
join pg_class c on i.indexrelid = c.oid
where am.amname = 'btree') r,
lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
group by r.oid, type)
select ct.relname table_name,
tots.oid::regclass::text index_name,
(select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
upper(type) page_type,
c npages,
to_char(avg_live_items, '990.999'),
to_char(avg_dead_items, '990.999'),
to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
%' as prop_of_index
from tots
join pg_index i on i.indexrelid = tots.oid
join pg_class ct on ct.oid = i.indrelid
where tots.oid = 'pgbench_accounts_pkey'::regclass
order by ct.relnamespace, table_name, index_name, npages, type;
table_name │ index_name │ non_meta_pages │ page_type
│ npages │ to_char │ to_char │ prop_of_index
──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
│ 1 │ 97.000 │ 0.000 │ 0.004 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
│ 97 │ 282.670 │ 0.000 │ 0.354 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
│ 27,323 │ 366.992 │ 0.000 │ 99.643 %
(3 rows)

But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).

Are you sure you have the details right, Thom?

*facepalm*

No, I'm not. I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense. I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.

Now I've checked it with the latest patch, and can see it working
fine. Apologies for the confusion.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Thom Brown (#24)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Thu, Feb 4, 2016 at 8:25 AM, Thom Brown <thom@linux.com> wrote:

No, I'm not. I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense. I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.

Right. Because, apart from everything else, unique indexes are not
currently supported.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#19)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I fixed it in the new version (attached).

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
     * scan key), which could be the last slot + 1.
     */
    if (P_ISLEAF(opaque))
+   {
+       if (low <= PageGetMaxOffsetNumber(page))
+       {
+           IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+           /* one excessive check of equality. for possible posting
tuple update or creation */
+           if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+               && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+               *updposing = true;
+       }
        return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;
?column?
──────────
t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
         * it off the old page, not the new one, in case we are not at leaf
         * level.
         */
-       state->btps_minkey = CopyIndexTuple(oitup);
+       ItemId iihk = PageGetItemId(opage, P_HIKEY);
+       IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+       state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage    \
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+           (sizeof(ItemPointerData))))

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

* More explanation of the design would be nice. I suggest modifying
the nbtree README file, so it's easy to tell what the current design
is. It's hard to follow this from the thread. When I reviewed Heikki's
B-Tree patches from a couple of years ago, we spent ~75% of the time
on design, and only ~25% on code.

* I have a paranoid feeling that the deletion locking protocol
(VACUUMing index tuples concurrently and safely) may need special
consideration here. Basically, with the B-Tree code, there are several
complicated locking protocols, like for page splits, page deletion,
and interlocking with vacuum ("super exclusive lock" stuff). These are
why the B-Tree code is complicated in general, and it's very important
to pin down exactly how we deal with each. Ideally, you'd have an
explanation for why your code was correct in each of these existing
cases (especially deletion). With very complicated and important code
like this, it's often wise to be very clear about when we are talking
about your design, and when we are talking about your code. It's
generally too hard to review both at the same time.

Ideally, when you talk about your design, you'll be able to say things
like "it's clear that this existing thing is correct; at least we have
no complaints from the field. Therefore, it must be true that my new
technique is also correct, because it makes that general situation no
worse". Obviously that kind of rigor is just something we aspire to,
and still fall short of at times. Still, it would be nice to
specifically see a reason why the new code isn't special from the
point of view of the super-exclusive lock thing (which is what I mean
by deletion locking protocol + special consideration). Or why it is
special, but that's okay, or whatever. This style of review is normal
when writing B-Tree code. Some other things don't need this rigor, or
have no invariants that need to be respected/used. Maybe this is
obvious to you already, but it isn't obvious to me.

It's okay if you don't know why, but knowing that you don't have a
strong opinion about something is itself useful information.

* I see you disabled the LP_DEAD thing; why? Just because that made
bugs go away?

* Have you done much stress testing? Using pgbench with many
concurrent VACUUM FREEZE operations would be a good idea, if you
haven't already, because that is insistent about getting super
exclusive locks, unlike regular VACUUM.

* Are you keeping the restriction of 1/3 of a buffer page, but that
just includes the posting list now? That's the kind of detail I'd like
to see in the README now.

* Why not support unique indexes? The obvious answer is that it isn't
worth it, but why? How useful would that be (a bit, just not enough)?
What's the trade-off?

Anyway, this is really cool work; I have often thought that we don't
have nearly enough people thinking about how to optimize B-Tree
indexing. It is hard, but so is anything worthwhile.

That's all I have for now. Just a quick review focused on code and
correctness (and not on the benefits). I want to do more on this,
especially the benefits, because it deserves more attention.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Peter Geoghegan (#26)

2 attachment(s)

Re: [WIP] Effective storage of duplicates in B-tree index.

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I fixed it in the new version (attached).

Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0
<https://vk.com/away.php?to=https%3A%2F%2Fgoo.gl%2F50O8Q0> (the same
text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
* scan key), which could be the last slot + 1.
*/
if (P_ISLEAF(opaque))
+   {
+       if (low <= PageGetMaxOffsetNumber(page))
+       {
+           IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+           /* one excessive check of equality. for possible posting
tuple update or creation */
+           if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+               && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+               *updposing = true;
+       }
return low;
+   }
* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;
?column?
──────────
t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

Thank you for the notice. Fixed.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
* it off the old page, not the new one, in case we are not at leaf
* level.
*/
-       state->btps_minkey = CopyIndexTuple(oitup);
+       ItemId iihk = PageGetItemId(opage, P_HIKEY);
+       IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+       state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

Yes, it is. I completed the comment above.

I understand that this code in incomplete/draft:
+#define MaxPackedIndexTuplesPerPage    \
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+           (sizeof(ItemPointerData))))
But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

I agree.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

btc_readme_1.0.patchtext/x-patch; name=btc_readme_1.0.patchDownload

Compression. To be correct, it’s not actually compression, but just effective layout of ItemPointers on an index page.
compressed tuple  = IndexTuple (with metadata in TID field+ key) + PostingList


1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. That’s why we can’t just copy the gin implementation of  item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if there’s more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I don’t sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that it’s worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isn’t a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I don’t see good reasons to do that.

2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I don’t sure whether it’s worth to handle this case separately because it’s really rare and I consider that it’s the DBA’s job to disable compression on such indexes. But if you see any clear way to do this, it would be great.

3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that:
Indexed keys are never deleted. It makes all tree algorithms much easier.
There are always many duplicates. Otherwise, gin becomes really inefficient.
There’s no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion. 

In btree we can’t afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below).

4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: “We actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.” That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ ⅓ pagesize for regular items, and ~ ½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple.

5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large enough to fit into an index tuple (maxposting size id computed as BTMaxItemSize - size of regular index tuple) or if the following tuple is not equal to the previous, we should create packed tuple using BtreeFormPackedTuple on posting list  (if any) and insert it into a page. The same we do if there are no more elements in the spool.

6. High key is not a real data, but just an upper bound of the keys that allowed on the page. So there’s no need to compress it. While copying a posting tuple into a high key, we should to get rid of posting list. A posting tuple should be truncated to length of a regular tuple, and the metadata in its TID field should be set with appropriate values. It’s worth to mention here a very specific point in _bt_buildadd(). If current page is full (there is no room for a new tuple), we copy the last item on the page into the new page, and then rearrange the old page so that the 'last item' becomes its high key rather than a true data item. If the last tuple was compressed, we can truncate it before setting as a high key. But, if it had a big posting list, there will be plenty of free space on the original page. So we must split Posting tuple into 2 pieces. see the picture below and comments in the code. I’m not sure about correctness of locking here, but I assume that there are no possible concurrent operations while building index. Is it right?

7. Another difference between gin and btree is that item pointers in gin posting list/tree are always ordered while btree doesn’t require this strictly. If there are many duplicates in btree, we don’t bother to find the ideal place to keep TIDs ordered. The insertion has a choice whether or not to move right. Currently, we just try to find a page where there is room for the new key. The next TODO item is to keep item pointers in posting list ordered. The advantage here is that the best compression of posting list could be reached on sorted TIDs.  What do you think about it?

8. Insertion. After we found the sutable place for insertion, check, whether the previous item has the same key. If so, and if there is enough room to add a pointer into the page, we can add it into item. There are two possible cases. If old item is a regular tuple, we should form new compressed tuple. Note, that this case requires to have enough space for two TIDs (metadata and new TID). Otherwise, we just add the pointer into existing posting list. Then delete old tuple and insert the new one.

9. Search. Fortunately, it’s quite easy to change search algorithm. If compressed tuple is found, just go over all TIDs and return them. If an index-only scan is processed, just return the same tuple N times in a row. To avoid storing duplicates in currTuples array, save the key once and then connect it with posting TIDs using tupleOffset. It’s clear that if compression is applied, the page could contain more tuples than if it has only uncompressed tuples. That is why MaxPackedIndexTuplesPerPage appears. Array items (which actually has currTuples and tupleOffset) in BTScanPos is preallocated with length = MaxPackedIndexTuplesPerPage, because we must be sure that all items would fit into the array.

10. Split. The only change in this section is posting list truncation before insert the tuple as a high key.

11. Vacuum. Check all TIDs in a posting list. If there are no live items in the compressed tuple, delete the tuple. Otherwise do the following: form new posting tuple, that contains remaining item pointers;  delete "old" posting;  insert new posting back to the page.  Microvacuum  of compressed tuples is not implemented yet. It’s possible to use high bit of offset field of item pointer to flag killed items. But it requires additional performance testing. 

12. Locking. Compressed index tuples use the same functions of insertion and deletion as regular index tuples. Most of the operations are performed inside standart functions and don’t need any specific locks. Although this issue defenitely requires more properly testing and review. All the operations where posting tuple is updated in place (deleted and then inserted again with new set of item pointers in posting list) are performed with special function _bt_pgupdtup().  As well as operation, where we want to replace one tuple with another one e.g. in btvacuumpage() and _bt_buildadd (see issue related to high key). 

13. Xlog. TODO.

btree_compression_3.0.patchtext/x-patch; name=btree_compression_3.0.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..d6922d5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
+#include "utils/datum.h"
 
 typedef struct
 {
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
+
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page 		page;
+	TupleDesc	itupdesc;
+	int			nipd;
+	IndexTuple 	olditup;
+	Size 		sizetoadd;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		bool updposting = false;
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -201,7 +210,45 @@ top:
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
 						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Decide, whether we can apply compression
+		 */
+		page = BufferGetPage(buf);
+
+		if(!IsSystemRelation(rel)
+			&& !rel->rd_index->indisunique
+			&& offset != InvalidOffsetNumber
+			&& offset <= PageGetMaxOffsetNumber(page))
+		{
+			itupdesc = RelationGetDescr(rel);
+			sizetoadd = sizeof(ItemPointerData);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if(_bt_isbinaryequal(itupdesc, olditup,
+									rel->rd_index->indnatts, itup))
+			{
+				if (!BtreeTupleIsPosting(olditup))
+				{
+					nipd = 1;
+					sizetoadd = sizetoadd*2;
+				}
+				else
+					nipd = BtreeGetNPosting(olditup);
+
+				if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+					&& PageGetFreeSpace(page) > sizetoadd)
+					updposting = true;
+			}
+		}
+
+		if (updposting)
+		{
+			_bt_pgupdtup(rel, page, offset, itup, true, olditup, nipd);
+			_bt_relbuf(rel, buf);
+		}
+		else
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
 	}
 	else
 	{
@@ -1042,6 +1089,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1120,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+
+		/* Truncate posting before insert it as a hikey. */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 					false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+						false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
 	}
 	leftoff = OffsetNumberNext(leftoff);
 
@@ -2103,6 +2177,76 @@ _bt_pgaddtup(Page page,
 }
 
 /*
+ * _bt_pgupdtup() -- update a tuple in place.
+ * This function is used for purposes of deduplication of item pointers.
+ * If new tuple to insert is equal to the tuple that already exists on the page,
+ * we can avoid key insertion and just add new item pointer.
+ *
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert
+ * concat - this flag shows, whether we should add new item to existing one
+ * or just replace old tuple with the new value. If concat is false, the
+ * following fields are senseless.
+ * nipd is the number of item pointers in old tuple.
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup,
+			 bool concat, IndexTuple olditup, int nipd)
+{
+	ItemPointerData *ipd;
+	IndexTuple 		newitup;
+	Size 			newitupsz;
+
+	if (concat)
+	{
+		ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+		/* copy item pointers from old tuple into ipd */
+		if (BtreeTupleIsPosting(olditup))
+			memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+		else
+			memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+		newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+
+		/*
+		* Update the tuple in place. We have already checked that the
+		* new tuple would fit into this page, so it's safe to delete
+		* old tuple and insert the new one without any side effects.
+		*/
+		newitupsz = IndexTupleDSize(*newitup);
+		newitupsz = MAXALIGN(newitupsz);
+	}
+	else
+	{
+		newitup = itup;
+		newitupsz = IndexTupleSize(itup);
+	}
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, offset);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, offset))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	//TODO add Xlog stuff
+
+	END_CRIT_SECTION();
+
+	if (concat)
+	{
+		pfree(ipd);
+		pfree(newitup);
+	}
+}
+
+/*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
  * This is very similar to _bt_compare, except for NULL handling.
@@ -2151,6 +2295,63 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 }
 
 /*
+ * _bt_isbinaryequal -  used in _bt_doinsert and _bt_load
+ * in check for duplicates. This is very similar to heap_tuple_attr_equals
+ * subroutine. And this function differs from _bt_isequal
+ * because here we require strict binary equality of tuples.
+ */
+bool
+_bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup,
+			int nindatts, IndexTuple ituptoinsert)
+{
+	AttrNumber	attno;
+
+	for (attno = 1; attno <= nindatts; attno++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isnull1,
+					isnull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(itup, attno, itupdesc, &isnull1);
+		datum2 = index_getattr(ituptoinsert, attno, itupdesc, &isnull2);
+
+		/*
+		 * If one value is NULL and other is not, then they are certainly not
+		 * equal
+		 */
+		if (isnull1 != isnull2)
+			return false;
+		/*
+		 * We do simple binary comparison of the two datums.  This may be overly
+		 * strict because there can be multiple binary representations for the
+		 * same logical value.  But we should be OK as long as there are no false
+		 * positives.  Using a type-specific equality operator is messy because
+		 * there could be multiple notions of equality in different operator
+		 * classes; furthermore, we cannot safely invoke user-defined functions
+		 * while holding exclusive buffer lock.
+		 */
+		if (attno <= 0)
+		{
+			/* The only allowed system columns are OIDs, so do this */
+			if (DatumGetObjectId(datum1) != DatumGetObjectId(datum2))
+				return false;
+		}
+		else
+		{
+			Assert(attno <= itupdesc->natts);
+			att = itupdesc->attrs[attno - 1];
+			if(!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+				return false;
+		}
+	}
+
+	/* if we get here, the keys are equal */
+	return true;
+}
+
+/*
  * _bt_vacuum_one_page - vacuum just one index page.
  *
  * Try to remove LP_DEAD items from the given page.  The passed buffer
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..a08c500 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,7 +74,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			 BTCycleId cycleid);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 			 BlockNumber orig_blkno);
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+						ItemPointerData *items,int nitem, int *nremaining);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +963,7 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1013,58 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
-
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if(BtreeTupleIsPosting(itup))
+				{
+					ItemPointer newipd;
+					int 		nipd,
+								nnewipd;
+
+					nipd = BtreeGetNPosting(itup);
+					newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+					if (newipd != NULL)
+					{
+						if (nnewipd > 0)
+						{
+							/* There are still some live tuples in the posting.
+							 * 1) form new posting tuple, that contains remaining ipds
+							 * 2) delete "old" posting and insert new posting back to the page
+							 */
+							remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+							_bt_pgupdtup(info->index, page, offnum, remaining, false, NULL, 0);
+						}
+						else
+							deletable[ndeletable++] = offnum;
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1160,3 +1189,50 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+/*
+ * btreevacuumPosting() -- vacuums a posting list.
+ * The size of the list must be specified via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+				   int nitem, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	ItemPointer tmpitems = NULL;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+
+	/*
+	 * Iterate over TIDs array
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (callback(items + i, callback_state))
+		{
+			if (!tmpitems)
+			{
+				/*
+				 * First TID to be deleted: allocate memory to hold the
+				 * remaining items.
+				 */
+				tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+			}
+		}
+		else
+		{
+			if (tmpitems)
+				tmpitems[remaining] = items[i];
+			remaining++;
+		}
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..301c019 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -1161,6 +1163,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	IndexTuple	itup;
 	bool		continuescan;
+	int 		i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1198,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1219,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
 			}
 			if (!continuescan)
 			{
@@ -1228,7 +1243,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			offnum = OffsetNumberNext(offnum);
 		}
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1236,7 +1251,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPackedIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1246,8 +1261,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1261,8 +1288,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1288,6 +1315,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size itupsz = BtreeGetPostingOffset(itup);
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e46930b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,7 +75,7 @@
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
-
+#include "catalog/catalog.h"
 
 /*
  * Status record for spooling/sorting phase.  (Note we may have two of
@@ -136,6 +136,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 			 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static SortSupport _bt_prepare_SortSupport(BTWriteState *wstate, int keysz);
+static int	_bt_call_comparator(SortSupport sortKeys, int i,
+				IndexTuple itup, IndexTuple itup2, TupleDesc tupdes);
 static void _bt_load(BTWriteState *wstate,
 		 BTSpool *btspool, BTSpool *btspool2);
 
@@ -527,15 +530,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(last_off > P_FIRSTKEY);
 		ii = PageGetItemId(opage, last_off);
 		oitup = (IndexTuple) PageGetItem(opage, ii);
-		_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
 
 		/*
-		 * Move 'last' into the high key position on opage
+		 * If the item is PostingTuple, we can cut it, because HIKEY
+		 * is not considered as real data, and it need not to keep any
+		 * ItemPointerData at all. And of course it need not to keep
+		 * a list of ipd.
+		 * But, if it had a big posting list, there will be plenty of
+		 * free space on the opage. In that case we must split posting
+		 * tuple into 2 pieces.
 		 */
-		hii = PageGetItemId(opage, P_HIKEY);
-		*hii = *ii;
-		ItemIdSetUnused(ii);	/* redundant */
-		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		 if (BtreeTupleIsPosting(oitup))
+		 {
+			IndexTuple  keytup;
+			Size 		keytupsz;
+			int 		nipd,
+						ntocut,
+						ntoleave;
+
+			nipd = BtreeGetNPosting(oitup);
+			ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+			ntocut++; /* round up to be sure that we cut enough */
+			ntoleave = nipd - ntocut;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(oitup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, oitup, keytupsz);
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+			if (ntocut < nipd)
+			{
+				ItemPointerData *newipd;
+				IndexTuple		newitup,
+								newlasttup;
+				/*
+				 * 1) Cut part of old tuple to shift to npage.
+				 * And insert it as P_FIRSTKEY.
+				 * This tuple is based on keytup.
+				 * Blkno & offnum are reset in BtreeFormPackedTuple.
+				 */
+				newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+				/* Note, that we cut last 'ntocut' items */
+				memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+				newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+				_bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+				pfree(newipd);
+				pfree(newitup);
+
+				/*
+				 * 2) set last item to the P_HIKEY linp
+				 * Move 'last' into the high key position on opage
+				 * NOTE: Do this because of indextuple deletion algorithm, which
+				 * doesn't allow to delete an item while we have unused one before it.
+				 */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+				/* 4) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+				newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+				_bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+				pfree(newlasttup);
+			}
+			else
+			{
+				/* The tuple isn't big enough to split it. Handle it as a regular tuple. */
+
+				/*
+				 * 1) Shift the last tuple to npage.
+				 * Insert it as P_FIRSTKEY.
+				 */
+				_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+				/* 2) set last item to the P_HIKEY linp */
+				/* Move 'last' into the high key position on opage */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+			}
+			pfree(keytup);
+		 }
+		 else
+		 {
+			/*
+			 * 1) Shift the last tuple to npage.
+			 * Insert it as P_FIRSTKEY.
+			 */
+			_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+			/* 2) set last item to the P_HIKEY linp */
+			/* Move 'last' into the high key position on opage */
+			hii = PageGetItemId(opage, P_HIKEY);
+			*hii = *ii;
+			ItemIdSetUnused(ii);	/* redundant */
+			((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		}
 
 		/*
 		 * Link the old page into its parent, using its minimum key. If we
@@ -547,6 +655,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 		Assert(state->btps_minkey != NULL);
 		ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
 		_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
 		pfree(state->btps_minkey);
 
@@ -554,8 +663,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * Save a copy of the minimum key for the new page.  We have to copy
 		 * it off the old page, not the new one, in case we are not at leaf
 		 * level.
+		 * We can not just copy oitup, because it could be posting tuple
+		 * and it's more safe just to get new inserted hikey.
 		 */
-		state->btps_minkey = CopyIndexTuple(oitup);
+		ItemId iihk = PageGetItemId(opage, P_HIKEY);
+		IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+		state->btps_minkey = CopyIndexTuple(hikey);
 
 		/*
 		 * Set the sibling links for both pages.
@@ -590,7 +703,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
+
+		if (BtreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple  keytup;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+			state->btps_minkey = CopyIndexTuple(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);
 	}
 
 	/*
@@ -670,6 +805,71 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+static SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+	ScanKey		indexScanKey;
+	SortSupport sortKeys;
+	int 		i;
+
+	/* Prepare SortSupport data for each column */
+	indexScanKey = _bt_mkscankey_nodata(wstate->index);
+	sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+	for (i = 0; i < keysz; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+		ScanKey		scanKey = indexScanKey + i;
+		int16		strategy;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = scanKey->sk_collation;
+		sortKey->ssup_nulls_first =
+			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+		sortKey->ssup_attno = scanKey->sk_attno;
+		/* Abbreviation is not supported here */
+		sortKey->abbreviate = false;
+
+		AssertState(sortKey->ssup_attno != 0);
+
+		strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+			BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+		PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+	}
+
+	_bt_freeskey(indexScanKey);
+	return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey on attribute i
+ */
+static int
+_bt_call_comparator(SortSupport sortKeys, int i,
+						 IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+		SortSupport entry;
+		Datum		attrDatum1,
+					attrDatum2;
+		bool		isNull1,
+					isNull2;
+		int32		compare;
+
+		entry = sortKeys + i - 1;
+		attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+		attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+		compare = ApplySortComparator(attrDatum1, isNull1,
+										attrDatum2, isNull2,
+										entry);
+
+		return compare;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -679,16 +879,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
-				itup2 = NULL;
+				itup2 = NULL,
+				itupprev = NULL;
 	bool		should_free,
 				should_free2,
 				load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = RelationGetNumberOfAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
+	int			ntuples = 0;
 	SortSupport sortKeys;
 
+	/* Prepare SortSupport structure for indextuples comparison */
+	sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
 	if (merge)
 	{
 		/*
@@ -701,34 +905,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 									   true, &should_free);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate,
 										true, &should_free2);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
-		/* Prepare SortSupport data for each column */
-		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
-		for (i = 0; i < keysz; i++)
-		{
-			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
-			int16		strategy;
-
-			sortKey->ssup_cxt = CurrentMemoryContext;
-			sortKey->ssup_collation = scanKey->sk_collation;
-			sortKey->ssup_nulls_first =
-				(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
-			sortKey->ssup_attno = scanKey->sk_attno;
-			/* Abbreviation is not supported here */
-			sortKey->abbreviate = false;
-
-			AssertState(sortKey->ssup_attno != 0);
-
-			strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
-				BTGreaterStrategyNumber : BTLessStrategyNumber;
-
-			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
-		}
-
-		_bt_freeskey(indexScanKey);
 
 		for (;;)
 		{
@@ -742,20 +918,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			{
 				for (i = 1; i <= keysz; i++)
 				{
-					SortSupport entry;
-					Datum		attrDatum1,
-								attrDatum2;
-					bool		isNull1,
-								isNull2;
-					int32		compare;
-
-					entry = sortKeys + i - 1;
-					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
-					attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
-					compare = ApplySortComparator(attrDatum1, isNull1,
-												  attrDatum2, isNull2,
-												  entry);
+					int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
 					if (compare > 0)
 					{
 						load1 = false;
@@ -794,16 +958,123 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	else
 	{
 		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+		Relation indexRelation = wstate->index;
+		Form_pg_index index = indexRelation->rd_index;
+
+		if (IsSystemRelation(indexRelation) || index->indisunique)
+		{
+			/* Do not use compression. */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
+
+				_bt_buildadd(wstate, state, itup);
+				if (should_free)
+					pfree(itup);
+			}
+		}
+		else
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			ItemPointerData *ipd = NULL;
+			IndexTuple 		postingtuple;
+			Size			maxitemsize = 0,
+							maxpostingsize = 0;
 
-			_bt_buildadd(wstate, state, itup);
-			if (should_free)
-				pfree(itup);
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				/*
+				 * Compare current tuple with previous one.
+				 * If tuples are equal, we can unite them into a posting list.
+				 */
+				if (itupprev != NULL)
+				{
+					if (_bt_isbinaryequal(tupdes, itupprev, index->indnatts, itup))
+					{
+						/* Tuples are equal. Create or update posting */
+						if (ntuples == 0)
+						{
+							/*
+							 * We haven't suitable posting list yet, so allocate
+							 * it and save both itupprev and current tuple.
+							 */
+							ipd = palloc0(maxitemsize);
+
+							memcpy(ipd, itupprev, sizeof(ItemPointerData));
+							ntuples++;
+							memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+							ntuples++;
+						}
+						else
+						{
+							if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+							{
+								memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+								ntuples++;
+							}
+							else
+							{
+								postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+								_bt_buildadd(wstate, state, postingtuple);
+								ntuples = 0;
+								pfree(ipd);
+							}
+						}
+
+					}
+					else
+					{
+						/* Tuples are not equal. Insert itupprev into index. */
+						if (ntuples == 0)
+							_bt_buildadd(wstate, state, itupprev);
+						else
+						{
+							postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+							_bt_buildadd(wstate, state, postingtuple);
+							ntuples = 0;
+							pfree(ipd);
+						}
+					}
+				}
+
+				/*
+				 * Copy the tuple into temp variable itupprev
+				 * to compare it with the following tuple
+				 * and maybe unite them into a posting tuple
+				 */
+				itupprev = CopyIndexTuple(itup);
+				if (should_free)
+					pfree(itup);
+
+				/* compute max size of ipd list */
+				maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+			}
+
+			/* Handle the last item.*/
+			if (ntuples == 0)
+			{
+				if (itupprev != NULL)
+					_bt_buildadd(wstate, state, itupprev);
+			}
+			else
+			{
+				Assert(ipd!=NULL);
+				Assert(itupprev != NULL);
+				postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+				_bt_buildadd(wstate, state, postingtuple);
+				ntuples = 0;
+				pfree(ipd);
+			}
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..8c9dda1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BtreeTupleIsPosting(ituple)
+				&& (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2063,3 +2065,69 @@ btoptions(Datum reloptions, bool validate)
 {
 	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
 }
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	uint32	   newsize;
+	IndexTuple itup = CopyIndexTuple(tuple);
+
+	/*
+	 * Determine and store offset to the posting list.
+	 */
+	newsize = IndexTupleSize(itup);
+	newsize = SHORTALIGN(newsize);
+
+	/*
+	 * Set meta info about the posting list.
+	 */
+	BtreeSetPostingOffset(itup, newsize);
+	BtreeSetNPosting(itup, nipd);
+	/*
+	 * Add space needed for posting list, if any.  Then check that the tuple
+	 * won't be too big to store.
+	 */
+	newsize += sizeof(ItemPointerData)*nipd;
+	newsize = MAXALIGN(newsize);
+
+	/*
+	 * Resize tuple if needed
+	 */
+	if (newsize != IndexTupleSize(itup))
+	{
+		itup = repalloc(itup, newsize);
+
+		/*
+		 * PostgreSQL 9.3 and earlier did not clear this new space, so we
+		 * might find uninitialized padding when reading tuples from disk.
+		 */
+		memset((char *) itup + IndexTupleSize(itup),
+			   0, newsize - IndexTupleSize(itup));
+		/* set new size in tuple header */
+		itup->t_info &= ~INDEX_SIZE_MASK;
+		itup->t_info |= newsize;
+	}
+
+	/*
+	 * Copy data into the posting tuple
+	 */
+	memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+	return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int size;
+	if (BtreeTupleIsPosting(tuple))
+	{
+		size = BtreeGetPostingOffset(tuple);
+		tuple->t_info &= ~INDEX_SIZE_MASK;
+		tuple->t_info |= size;
+	}
+
+	return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..3dd19c0 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -138,7 +138,6 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
 			(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
 
-
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
 				 Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..16a23b2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -538,6 +538,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for Posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -550,7 +552,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -651,6 +653,36 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+	BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+	((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup)			BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n)		ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup)		(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n)	ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup)    	(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup)			(ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n)		(ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+/*
+ * If compression is applied, the page could contain more tuples
+ * than if it has only uncompressed tuples, so we need new max value.
+ * Note that it is a rough upper estimate.
+ */
+#define MaxPackedIndexTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+			(sizeof(ItemPointerData))))
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -684,6 +716,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup, 
+						 bool concat, IndexTuple olditup, int nipd);
+extern bool _bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup, int nindatts, IndexTuple ituptoinsert);
 
 /*
  * prototypes for functions in nbtpage.c
@@ -715,8 +750,8 @@ extern BTStack _bt_search(Relation rel,
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
 			  int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+								ScanKey scankey, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
 			Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +782,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c

#28

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Anastasia Lubennikova (#27)

2 attachment(s)

Re: [WIP] Effective storage of duplicates in B-tree index.

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I fixed it in the new version (attached).

Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Sorry, previous patch was dirty. Hotfix is attached.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

btc_readme_1.0.patchtext/x-patch; name=btc_readme_1.0.patchDownload

Compression. To be correct, it’s not actually compression, but just effective layout of ItemPointers on an index page.
compressed tuple  = IndexTuple (with metadata in TID field+ key) + PostingList


1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. That’s why we can’t just copy the gin implementation of  item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if there’s more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I don’t sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that it’s worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isn’t a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I don’t see good reasons to do that.

2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I don’t sure whether it’s worth to handle this case separately because it’s really rare and I consider that it’s the DBA’s job to disable compression on such indexes. But if you see any clear way to do this, it would be great.

3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that:
Indexed keys are never deleted. It makes all tree algorithms much easier.
There are always many duplicates. Otherwise, gin becomes really inefficient.
There’s no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion. 

In btree we can’t afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below).

4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: “We actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.” That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ ⅓ pagesize for regular items, and ~ ½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple.

5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large enough to fit into an index tuple (maxposting size id computed as BTMaxItemSize - size of regular index tuple) or if the following tuple is not equal to the previous, we should create packed tuple using BtreeFormPackedTuple on posting list  (if any) and insert it into a page. The same we do if there are no more elements in the spool.

6. High key is not a real data, but just an upper bound of the keys that allowed on the page. So there’s no need to compress it. While copying a posting tuple into a high key, we should to get rid of posting list. A posting tuple should be truncated to length of a regular tuple, and the metadata in its TID field should be set with appropriate values. It’s worth to mention here a very specific point in _bt_buildadd(). If current page is full (there is no room for a new tuple), we copy the last item on the page into the new page, and then rearrange the old page so that the 'last item' becomes its high key rather than a true data item. If the last tuple was compressed, we can truncate it before setting as a high key. But, if it had a big posting list, there will be plenty of free space on the original page. So we must split Posting tuple into 2 pieces. see the picture below and comments in the code. I’m not sure about correctness of locking here, but I assume that there are no possible concurrent operations while building index. Is it right?

7. Another difference between gin and btree is that item pointers in gin posting list/tree are always ordered while btree doesn’t require this strictly. If there are many duplicates in btree, we don’t bother to find the ideal place to keep TIDs ordered. The insertion has a choice whether or not to move right. Currently, we just try to find a page where there is room for the new key. The next TODO item is to keep item pointers in posting list ordered. The advantage here is that the best compression of posting list could be reached on sorted TIDs.  What do you think about it?

8. Insertion. After we found the sutable place for insertion, check, whether the previous item has the same key. If so, and if there is enough room to add a pointer into the page, we can add it into item. There are two possible cases. If old item is a regular tuple, we should form new compressed tuple. Note, that this case requires to have enough space for two TIDs (metadata and new TID). Otherwise, we just add the pointer into existing posting list. Then delete old tuple and insert the new one.

9. Search. Fortunately, it’s quite easy to change search algorithm. If compressed tuple is found, just go over all TIDs and return them. If an index-only scan is processed, just return the same tuple N times in a row. To avoid storing duplicates in currTuples array, save the key once and then connect it with posting TIDs using tupleOffset. It’s clear that if compression is applied, the page could contain more tuples than if it has only uncompressed tuples. That is why MaxPackedIndexTuplesPerPage appears. Array items (which actually has currTuples and tupleOffset) in BTScanPos is preallocated with length = MaxPackedIndexTuplesPerPage, because we must be sure that all items would fit into the array.

10. Split. The only change in this section is posting list truncation before insert the tuple as a high key.

11. Vacuum. Check all TIDs in a posting list. If there are no live items in the compressed tuple, delete the tuple. Otherwise do the following: form new posting tuple, that contains remaining item pointers;  delete "old" posting;  insert new posting back to the page.  Microvacuum  of compressed tuples is not implemented yet. It’s possible to use high bit of offset field of item pointer to flag killed items. But it requires additional performance testing. 

12. Locking. Compressed index tuples use the same functions of insertion and deletion as regular index tuples. Most of the operations are performed inside standart functions and don’t need any specific locks. Although this issue defenitely requires more properly testing and review. All the operations where posting tuple is updated in place (deleted and then inserted again with new set of item pointers in posting list) are performed with special function _bt_pgupdtup().  As well as operation, where we want to replace one tuple with another one e.g. in btvacuumpage() and _bt_buildadd (see issue related to high key). 

13. Xlog. TODO.

btree_compression_3.1.patchtext/x-patch; name=btree_compression_3.1.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..d6922d5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
+#include "utils/datum.h"
 
 typedef struct
 {
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
+
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page 		page;
+	TupleDesc	itupdesc;
+	int			nipd;
+	IndexTuple 	olditup;
+	Size 		sizetoadd;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		bool updposting = false;
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -201,7 +210,45 @@ top:
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
 						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Decide, whether we can apply compression
+		 */
+		page = BufferGetPage(buf);
+
+		if(!IsSystemRelation(rel)
+			&& !rel->rd_index->indisunique
+			&& offset != InvalidOffsetNumber
+			&& offset <= PageGetMaxOffsetNumber(page))
+		{
+			itupdesc = RelationGetDescr(rel);
+			sizetoadd = sizeof(ItemPointerData);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if(_bt_isbinaryequal(itupdesc, olditup,
+									rel->rd_index->indnatts, itup))
+			{
+				if (!BtreeTupleIsPosting(olditup))
+				{
+					nipd = 1;
+					sizetoadd = sizetoadd*2;
+				}
+				else
+					nipd = BtreeGetNPosting(olditup);
+
+				if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+					&& PageGetFreeSpace(page) > sizetoadd)
+					updposting = true;
+			}
+		}
+
+		if (updposting)
+		{
+			_bt_pgupdtup(rel, page, offset, itup, true, olditup, nipd);
+			_bt_relbuf(rel, buf);
+		}
+		else
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
 	}
 	else
 	{
@@ -1042,6 +1089,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1120,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+
+		/* Truncate posting before insert it as a hikey. */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 					false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+						false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
 	}
 	leftoff = OffsetNumberNext(leftoff);
 
@@ -2103,6 +2177,76 @@ _bt_pgaddtup(Page page,
 }
 
 /*
+ * _bt_pgupdtup() -- update a tuple in place.
+ * This function is used for purposes of deduplication of item pointers.
+ * If new tuple to insert is equal to the tuple that already exists on the page,
+ * we can avoid key insertion and just add new item pointer.
+ *
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert
+ * concat - this flag shows, whether we should add new item to existing one
+ * or just replace old tuple with the new value. If concat is false, the
+ * following fields are senseless.
+ * nipd is the number of item pointers in old tuple.
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup,
+			 bool concat, IndexTuple olditup, int nipd)
+{
+	ItemPointerData *ipd;
+	IndexTuple 		newitup;
+	Size 			newitupsz;
+
+	if (concat)
+	{
+		ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+		/* copy item pointers from old tuple into ipd */
+		if (BtreeTupleIsPosting(olditup))
+			memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+		else
+			memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+		newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+
+		/*
+		* Update the tuple in place. We have already checked that the
+		* new tuple would fit into this page, so it's safe to delete
+		* old tuple and insert the new one without any side effects.
+		*/
+		newitupsz = IndexTupleDSize(*newitup);
+		newitupsz = MAXALIGN(newitupsz);
+	}
+	else
+	{
+		newitup = itup;
+		newitupsz = IndexTupleSize(itup);
+	}
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, offset);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, offset))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	//TODO add Xlog stuff
+
+	END_CRIT_SECTION();
+
+	if (concat)
+	{
+		pfree(ipd);
+		pfree(newitup);
+	}
+}
+
+/*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
  * This is very similar to _bt_compare, except for NULL handling.
@@ -2151,6 +2295,63 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 }
 
 /*
+ * _bt_isbinaryequal -  used in _bt_doinsert and _bt_load
+ * in check for duplicates. This is very similar to heap_tuple_attr_equals
+ * subroutine. And this function differs from _bt_isequal
+ * because here we require strict binary equality of tuples.
+ */
+bool
+_bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup,
+			int nindatts, IndexTuple ituptoinsert)
+{
+	AttrNumber	attno;
+
+	for (attno = 1; attno <= nindatts; attno++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isnull1,
+					isnull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(itup, attno, itupdesc, &isnull1);
+		datum2 = index_getattr(ituptoinsert, attno, itupdesc, &isnull2);
+
+		/*
+		 * If one value is NULL and other is not, then they are certainly not
+		 * equal
+		 */
+		if (isnull1 != isnull2)
+			return false;
+		/*
+		 * We do simple binary comparison of the two datums.  This may be overly
+		 * strict because there can be multiple binary representations for the
+		 * same logical value.  But we should be OK as long as there are no false
+		 * positives.  Using a type-specific equality operator is messy because
+		 * there could be multiple notions of equality in different operator
+		 * classes; furthermore, we cannot safely invoke user-defined functions
+		 * while holding exclusive buffer lock.
+		 */
+		if (attno <= 0)
+		{
+			/* The only allowed system columns are OIDs, so do this */
+			if (DatumGetObjectId(datum1) != DatumGetObjectId(datum2))
+				return false;
+		}
+		else
+		{
+			Assert(attno <= itupdesc->natts);
+			att = itupdesc->attrs[attno - 1];
+			if(!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+				return false;
+		}
+	}
+
+	/* if we get here, the keys are equal */
+	return true;
+}
+
+/*
  * _bt_vacuum_one_page - vacuum just one index page.
  *
  * Try to remove LP_DEAD items from the given page.  The passed buffer
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..a08c500 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,7 +74,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			 BTCycleId cycleid);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 			 BlockNumber orig_blkno);
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+						ItemPointerData *items,int nitem, int *nremaining);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +963,7 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1013,58 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
-
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if(BtreeTupleIsPosting(itup))
+				{
+					ItemPointer newipd;
+					int 		nipd,
+								nnewipd;
+
+					nipd = BtreeGetNPosting(itup);
+					newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+					if (newipd != NULL)
+					{
+						if (nnewipd > 0)
+						{
+							/* There are still some live tuples in the posting.
+							 * 1) form new posting tuple, that contains remaining ipds
+							 * 2) delete "old" posting and insert new posting back to the page
+							 */
+							remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+							_bt_pgupdtup(info->index, page, offnum, remaining, false, NULL, 0);
+						}
+						else
+							deletable[ndeletable++] = offnum;
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1160,3 +1189,50 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+/*
+ * btreevacuumPosting() -- vacuums a posting list.
+ * The size of the list must be specified via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+				   int nitem, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	ItemPointer tmpitems = NULL;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+
+	/*
+	 * Iterate over TIDs array
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (callback(items + i, callback_state))
+		{
+			if (!tmpitems)
+			{
+				/*
+				 * First TID to be deleted: allocate memory to hold the
+				 * remaining items.
+				 */
+				tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+			}
+		}
+		else
+		{
+			if (tmpitems)
+				tmpitems[remaining] = items[i];
+			remaining++;
+		}
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..301c019 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -1161,6 +1163,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	IndexTuple	itup;
 	bool		continuescan;
+	int 		i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1198,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1219,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
 			}
 			if (!continuescan)
 			{
@@ -1228,7 +1243,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			offnum = OffsetNumberNext(offnum);
 		}
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1236,7 +1251,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPackedIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1246,8 +1261,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1261,8 +1288,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1288,6 +1315,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size itupsz = BtreeGetPostingOffset(itup);
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e46930b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,7 +75,7 @@
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
-
+#include "catalog/catalog.h"
 
 /*
  * Status record for spooling/sorting phase.  (Note we may have two of
@@ -136,6 +136,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 			 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static SortSupport _bt_prepare_SortSupport(BTWriteState *wstate, int keysz);
+static int	_bt_call_comparator(SortSupport sortKeys, int i,
+				IndexTuple itup, IndexTuple itup2, TupleDesc tupdes);
 static void _bt_load(BTWriteState *wstate,
 		 BTSpool *btspool, BTSpool *btspool2);
 
@@ -527,15 +530,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(last_off > P_FIRSTKEY);
 		ii = PageGetItemId(opage, last_off);
 		oitup = (IndexTuple) PageGetItem(opage, ii);
-		_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
 
 		/*
-		 * Move 'last' into the high key position on opage
+		 * If the item is PostingTuple, we can cut it, because HIKEY
+		 * is not considered as real data, and it need not to keep any
+		 * ItemPointerData at all. And of course it need not to keep
+		 * a list of ipd.
+		 * But, if it had a big posting list, there will be plenty of
+		 * free space on the opage. In that case we must split posting
+		 * tuple into 2 pieces.
 		 */
-		hii = PageGetItemId(opage, P_HIKEY);
-		*hii = *ii;
-		ItemIdSetUnused(ii);	/* redundant */
-		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		 if (BtreeTupleIsPosting(oitup))
+		 {
+			IndexTuple  keytup;
+			Size 		keytupsz;
+			int 		nipd,
+						ntocut,
+						ntoleave;
+
+			nipd = BtreeGetNPosting(oitup);
+			ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+			ntocut++; /* round up to be sure that we cut enough */
+			ntoleave = nipd - ntocut;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(oitup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, oitup, keytupsz);
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+			if (ntocut < nipd)
+			{
+				ItemPointerData *newipd;
+				IndexTuple		newitup,
+								newlasttup;
+				/*
+				 * 1) Cut part of old tuple to shift to npage.
+				 * And insert it as P_FIRSTKEY.
+				 * This tuple is based on keytup.
+				 * Blkno & offnum are reset in BtreeFormPackedTuple.
+				 */
+				newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+				/* Note, that we cut last 'ntocut' items */
+				memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+				newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+				_bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+				pfree(newipd);
+				pfree(newitup);
+
+				/*
+				 * 2) set last item to the P_HIKEY linp
+				 * Move 'last' into the high key position on opage
+				 * NOTE: Do this because of indextuple deletion algorithm, which
+				 * doesn't allow to delete an item while we have unused one before it.
+				 */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+				/* 4) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+				newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+				_bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+				pfree(newlasttup);
+			}
+			else
+			{
+				/* The tuple isn't big enough to split it. Handle it as a regular tuple. */
+
+				/*
+				 * 1) Shift the last tuple to npage.
+				 * Insert it as P_FIRSTKEY.
+				 */
+				_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+				/* 2) set last item to the P_HIKEY linp */
+				/* Move 'last' into the high key position on opage */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+			}
+			pfree(keytup);
+		 }
+		 else
+		 {
+			/*
+			 * 1) Shift the last tuple to npage.
+			 * Insert it as P_FIRSTKEY.
+			 */
+			_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+			/* 2) set last item to the P_HIKEY linp */
+			/* Move 'last' into the high key position on opage */
+			hii = PageGetItemId(opage, P_HIKEY);
+			*hii = *ii;
+			ItemIdSetUnused(ii);	/* redundant */
+			((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		}
 
 		/*
 		 * Link the old page into its parent, using its minimum key. If we
@@ -547,6 +655,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 		Assert(state->btps_minkey != NULL);
 		ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
 		_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
 		pfree(state->btps_minkey);
 
@@ -554,8 +663,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * Save a copy of the minimum key for the new page.  We have to copy
 		 * it off the old page, not the new one, in case we are not at leaf
 		 * level.
+		 * We can not just copy oitup, because it could be posting tuple
+		 * and it's more safe just to get new inserted hikey.
 		 */
-		state->btps_minkey = CopyIndexTuple(oitup);
+		ItemId iihk = PageGetItemId(opage, P_HIKEY);
+		IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+		state->btps_minkey = CopyIndexTuple(hikey);
 
 		/*
 		 * Set the sibling links for both pages.
@@ -590,7 +703,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
+
+		if (BtreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple  keytup;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+			state->btps_minkey = CopyIndexTuple(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);
 	}
 
 	/*
@@ -670,6 +805,71 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+static SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+	ScanKey		indexScanKey;
+	SortSupport sortKeys;
+	int 		i;
+
+	/* Prepare SortSupport data for each column */
+	indexScanKey = _bt_mkscankey_nodata(wstate->index);
+	sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+	for (i = 0; i < keysz; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+		ScanKey		scanKey = indexScanKey + i;
+		int16		strategy;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = scanKey->sk_collation;
+		sortKey->ssup_nulls_first =
+			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+		sortKey->ssup_attno = scanKey->sk_attno;
+		/* Abbreviation is not supported here */
+		sortKey->abbreviate = false;
+
+		AssertState(sortKey->ssup_attno != 0);
+
+		strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+			BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+		PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+	}
+
+	_bt_freeskey(indexScanKey);
+	return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey on attribute i
+ */
+static int
+_bt_call_comparator(SortSupport sortKeys, int i,
+						 IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+		SortSupport entry;
+		Datum		attrDatum1,
+					attrDatum2;
+		bool		isNull1,
+					isNull2;
+		int32		compare;
+
+		entry = sortKeys + i - 1;
+		attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+		attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+		compare = ApplySortComparator(attrDatum1, isNull1,
+										attrDatum2, isNull2,
+										entry);
+
+		return compare;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -679,16 +879,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
-				itup2 = NULL;
+				itup2 = NULL,
+				itupprev = NULL;
 	bool		should_free,
 				should_free2,
 				load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = RelationGetNumberOfAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
+	int			ntuples = 0;
 	SortSupport sortKeys;
 
+	/* Prepare SortSupport structure for indextuples comparison */
+	sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
 	if (merge)
 	{
 		/*
@@ -701,34 +905,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 									   true, &should_free);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate,
 										true, &should_free2);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
-		/* Prepare SortSupport data for each column */
-		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
-		for (i = 0; i < keysz; i++)
-		{
-			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
-			int16		strategy;
-
-			sortKey->ssup_cxt = CurrentMemoryContext;
-			sortKey->ssup_collation = scanKey->sk_collation;
-			sortKey->ssup_nulls_first =
-				(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
-			sortKey->ssup_attno = scanKey->sk_attno;
-			/* Abbreviation is not supported here */
-			sortKey->abbreviate = false;
-
-			AssertState(sortKey->ssup_attno != 0);
-
-			strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
-				BTGreaterStrategyNumber : BTLessStrategyNumber;
-
-			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
-		}
-
-		_bt_freeskey(indexScanKey);
 
 		for (;;)
 		{
@@ -742,20 +918,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			{
 				for (i = 1; i <= keysz; i++)
 				{
-					SortSupport entry;
-					Datum		attrDatum1,
-								attrDatum2;
-					bool		isNull1,
-								isNull2;
-					int32		compare;
-
-					entry = sortKeys + i - 1;
-					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
-					attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
-					compare = ApplySortComparator(attrDatum1, isNull1,
-												  attrDatum2, isNull2,
-												  entry);
+					int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
 					if (compare > 0)
 					{
 						load1 = false;
@@ -794,16 +958,123 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	else
 	{
 		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+		Relation indexRelation = wstate->index;
+		Form_pg_index index = indexRelation->rd_index;
+
+		if (IsSystemRelation(indexRelation) || index->indisunique)
+		{
+			/* Do not use compression. */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
+
+				_bt_buildadd(wstate, state, itup);
+				if (should_free)
+					pfree(itup);
+			}
+		}
+		else
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			ItemPointerData *ipd = NULL;
+			IndexTuple 		postingtuple;
+			Size			maxitemsize = 0,
+							maxpostingsize = 0;
 
-			_bt_buildadd(wstate, state, itup);
-			if (should_free)
-				pfree(itup);
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				/*
+				 * Compare current tuple with previous one.
+				 * If tuples are equal, we can unite them into a posting list.
+				 */
+				if (itupprev != NULL)
+				{
+					if (_bt_isbinaryequal(tupdes, itupprev, index->indnatts, itup))
+					{
+						/* Tuples are equal. Create or update posting */
+						if (ntuples == 0)
+						{
+							/*
+							 * We haven't suitable posting list yet, so allocate
+							 * it and save both itupprev and current tuple.
+							 */
+							ipd = palloc0(maxitemsize);
+
+							memcpy(ipd, itupprev, sizeof(ItemPointerData));
+							ntuples++;
+							memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+							ntuples++;
+						}
+						else
+						{
+							if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+							{
+								memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+								ntuples++;
+							}
+							else
+							{
+								postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+								_bt_buildadd(wstate, state, postingtuple);
+								ntuples = 0;
+								pfree(ipd);
+							}
+						}
+
+					}
+					else
+					{
+						/* Tuples are not equal. Insert itupprev into index. */
+						if (ntuples == 0)
+							_bt_buildadd(wstate, state, itupprev);
+						else
+						{
+							postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+							_bt_buildadd(wstate, state, postingtuple);
+							ntuples = 0;
+							pfree(ipd);
+						}
+					}
+				}
+
+				/*
+				 * Copy the tuple into temp variable itupprev
+				 * to compare it with the following tuple
+				 * and maybe unite them into a posting tuple
+				 */
+				itupprev = CopyIndexTuple(itup);
+				if (should_free)
+					pfree(itup);
+
+				/* compute max size of ipd list */
+				maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+			}
+
+			/* Handle the last item.*/
+			if (ntuples == 0)
+			{
+				if (itupprev != NULL)
+					_bt_buildadd(wstate, state, itupprev);
+			}
+			else
+			{
+				Assert(ipd!=NULL);
+				Assert(itupprev != NULL);
+				postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+				_bt_buildadd(wstate, state, postingtuple);
+				ntuples = 0;
+				pfree(ipd);
+			}
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..8c9dda1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BtreeTupleIsPosting(ituple)
+				&& (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2063,3 +2065,69 @@ btoptions(Datum reloptions, bool validate)
 {
 	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
 }
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	uint32	   newsize;
+	IndexTuple itup = CopyIndexTuple(tuple);
+
+	/*
+	 * Determine and store offset to the posting list.
+	 */
+	newsize = IndexTupleSize(itup);
+	newsize = SHORTALIGN(newsize);
+
+	/*
+	 * Set meta info about the posting list.
+	 */
+	BtreeSetPostingOffset(itup, newsize);
+	BtreeSetNPosting(itup, nipd);
+	/*
+	 * Add space needed for posting list, if any.  Then check that the tuple
+	 * won't be too big to store.
+	 */
+	newsize += sizeof(ItemPointerData)*nipd;
+	newsize = MAXALIGN(newsize);
+
+	/*
+	 * Resize tuple if needed
+	 */
+	if (newsize != IndexTupleSize(itup))
+	{
+		itup = repalloc(itup, newsize);
+
+		/*
+		 * PostgreSQL 9.3 and earlier did not clear this new space, so we
+		 * might find uninitialized padding when reading tuples from disk.
+		 */
+		memset((char *) itup + IndexTupleSize(itup),
+			   0, newsize - IndexTupleSize(itup));
+		/* set new size in tuple header */
+		itup->t_info &= ~INDEX_SIZE_MASK;
+		itup->t_info |= newsize;
+	}
+
+	/*
+	 * Copy data into the posting tuple
+	 */
+	memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+	return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int size;
+	if (BtreeTupleIsPosting(tuple))
+	{
+		size = BtreeGetPostingOffset(tuple);
+		tuple->t_info &= ~INDEX_SIZE_MASK;
+		tuple->t_info |= size;
+	}
+
+	return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..3dd19c0 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -138,7 +138,6 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
 			(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
 
-
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
 				 Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..dc82ce7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,6 +122,15 @@ typedef struct BTMetaPageData
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
 /*
+ * If compression is applied, the page could contain more tuples
+ * than if it has only uncompressed tuples, so we need new max value.
+ * Note that it is a rough upper estimate.
+ */
+#define MaxPackedIndexTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+			(sizeof(ItemPointerData))))
+
+/*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
@@ -538,6 +547,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for Posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -550,7 +561,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -651,6 +662,27 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+	BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+	((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup)			BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n)		ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup)		(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n)	ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup)    	(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup)			(ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n)		(ItemPointerData*) (BtreeGetPosting(itup) + n)
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -684,6 +716,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup, 
+						 bool concat, IndexTuple olditup, int nipd);
+extern bool _bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup, int nindatts, IndexTuple ituptoinsert);
 
 /*
  * prototypes for functions in nbtpage.c
@@ -715,8 +750,8 @@ extern BTStack _bt_search(Relation rel,
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
 			  int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+								ScanKey scankey, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
 			Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +782,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c

#29

David Steele

david@pgmasters.net

almost 10 years ago

In reply to: Anastasia Lubennikova (#28)

Re: [WIP] Effective storage of duplicates in B-tree index.

Hi Anastasia,

On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I fixed it in the new version (attached).

Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Sorry, previous patch was dirty. Hotfix is attached.

This looks like an extremely valuable optimization for btree indexes but
unfortunately it is not getting a lot of attention. It still applies
cleanly for anyone interested in reviewing.

It's not clear to me that you answered all of Peter's questions in [1]/messages/by-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com.
I understand that you've provided a README but it may not be clear if
the answers are in there (and where).

Also, at the end of the README it says:

13. Xlog. TODO.

Does that mean the patch is not yet complete?

Thanks,
--
-David
david@pgmasters.net

[1]: /messages/by-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com
/messages/by-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: David Steele (#29)

Re: [WIP] Effective storage of duplicates in B-tree index.

14.03.2016 16:02, David Steele:

Hi Anastasia,

On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I fixed it in the new version (attached).

Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Sorry, previous patch was dirty. Hotfix is attached.

This looks like an extremely valuable optimization for btree indexes
but unfortunately it is not getting a lot of attention. It still
applies cleanly for anyone interested in reviewing.

Thank you for attention.
I would be indebted to all reviewers, who can just try this patch on
real data and workload (except WAL for now).
B-tree needs very much testing.

It's not clear to me that you answered all of Peter's questions in
[1]. I understand that you've provided a README but it may not be
clear if the answers are in there (and where).

I described in README all the points Peter asked.
But I see that it'd be better to answer directly.
Thanks for reminding, I'll do it tomorrow.

Also, at the end of the README it says:

13. Xlog. TODO.

Does that mean the patch is not yet complete?

Yes, you're right.
Frankly speaking, I supposed that someone will help me with that stuff,
but now I almost completed it. I'll send updated patch in the next letter.

I'm still doubtful about some patch details. I mentioned them in readme
(bold type).
But they are mostly about future improvements.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Anastasia Lubennikova (#30)

1 attachment(s)

Re: [WIP] Effective storage of duplicates in B-tree index.

Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

btree_compression_4.0.patchtext/x-patch; name=btree_compression_4.0.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..72acc0f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
+#include "utils/datum.h"
 
 typedef struct
 {
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
+
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page 		page;
+	TupleDesc	itupdesc;
+	int			nipd;
+	IndexTuple 	olditup;
+	Size 		sizetoadd;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		bool updposting = false;
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -201,7 +210,42 @@ top:
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
 						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Decide, whether we can apply compression
+		 */
+		page = BufferGetPage(buf);
+
+		if(!IsSystemRelation(rel)
+			&& !rel->rd_index->indisunique
+			&& offset != InvalidOffsetNumber
+			&& offset <= PageGetMaxOffsetNumber(page))
+		{
+			itupdesc = RelationGetDescr(rel);
+			sizetoadd = sizeof(ItemPointerData);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if(_bt_isbinaryequal(itupdesc, olditup,
+									rel->rd_index->indnatts, itup))
+			{
+				if (!BtreeTupleIsPosting(olditup))
+				{
+					nipd = 1;
+					sizetoadd = sizetoadd*2;
+				}
+				else
+					nipd = BtreeGetNPosting(olditup);
+
+				if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+					&& PageGetFreeSpace(page) > sizetoadd)
+					updposting = true;
+			}
+		}
+
+		if (updposting)
+			_bt_pgupdtup(rel, buf, offset, itup, olditup, nipd);
+		else
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
 	}
 	else
 	{
@@ -1042,6 +1086,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1117,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+
+		/* Truncate posting before insert it as a hikey. */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 					false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+						false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
 	}
 	leftoff = OffsetNumberNext(leftoff);
 
@@ -2103,6 +2174,120 @@ _bt_pgaddtup(Page page,
 }
 
 /*
+ * _bt_pgupdtup() -- update a tuple in place.
+ * This function is used for purposes of deduplication of item pointers.
+ *
+ * If new tuple to insert is equal to the tuple that already exists
+ * on the page, we can avoid key insertion and just add new item pointer.
+ *
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert.
+ * olditup is the old tuple itself.
+ * nipd is the number of item pointers in old tuple.
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgupdtup(Relation rel, Buffer buf, OffsetNumber offset, IndexTuple itup,
+			 IndexTuple olditup, int nipd)
+{
+	ItemPointerData *ipd;
+	IndexTuple 		newitup;
+	Size 			newitupsz;
+	Page			page;
+
+	page = BufferGetPage(buf);
+
+	ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+	/* copy item pointers from old tuple into ipd */
+	if (BtreeTupleIsPosting(olditup))
+		memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+	else
+		memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+	/* add item pointer of the new tuple into ipd */
+	memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+	newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+
+	/*
+	* Update the tuple in place. We have already checked that the
+	* new tuple would fit into this page, so it's safe to delete
+	* old tuple and insert the new one without any side effects.
+	*/
+	newitupsz = IndexTupleDSize(*newitup);
+	newitupsz = MAXALIGN(newitupsz);
+
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, offset);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, offset))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	MarkBufferDirty(buf);
+
+	/* Xlog stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_insert xlrec;
+		uint8		xlinfo;
+		XLogRecPtr	recptr;
+		BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		xlrec.offnum = offset;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+		/* TODO add some Xlog stuff for inner pages?
+		 * Don't sure if we really need it?*/
+		Assert(P_ISLEAF(pageop));
+		xlinfo = XLOG_BTREE_UPDATE_TUPLE;
+
+		/* Read comments in _bt_pgaddtup */
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	pfree(ipd);
+	pfree(newitup);
+	_bt_relbuf(rel, buf);
+}
+
+/*
+ * _bt_pgrewritetup() -- update a tuple in place.
+ * This function is used for handling compressed tuples.
+ * It is used to update compressed tuple after vacuuming.
+ * and to rewirite hikey while building index.
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgrewritetup(Relation rel, Buffer buf, Page page, OffsetNumber offset, IndexTuple itup)
+{
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, offset);
+
+	if (!_bt_pgaddtup(page, IndexTupleSize(itup), itup, offset))
+		elog(ERROR, "failed to rewrite compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	END_CRIT_SECTION();
+}
+
+/*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
  * This is very similar to _bt_compare, except for NULL handling.
@@ -2151,6 +2336,63 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 }
 
 /*
+ * _bt_isbinaryequal -  used in _bt_doinsert and _bt_load
+ * in check for duplicates. This is very similar to heap_tuple_attr_equals
+ * subroutine. And this function differs from _bt_isequal
+ * because here we require strict binary equality of tuples.
+ */
+bool
+_bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup,
+			int nindatts, IndexTuple ituptoinsert)
+{
+	AttrNumber	attno;
+
+	for (attno = 1; attno <= nindatts; attno++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isnull1,
+					isnull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(itup, attno, itupdesc, &isnull1);
+		datum2 = index_getattr(ituptoinsert, attno, itupdesc, &isnull2);
+
+		/*
+		 * If one value is NULL and other is not, then they are certainly not
+		 * equal
+		 */
+		if (isnull1 != isnull2)
+			return false;
+		/*
+		 * We do simple binary comparison of the two datums.  This may be overly
+		 * strict because there can be multiple binary representations for the
+		 * same logical value.  But we should be OK as long as there are no false
+		 * positives.  Using a type-specific equality operator is messy because
+		 * there could be multiple notions of equality in different operator
+		 * classes; furthermore, we cannot safely invoke user-defined functions
+		 * while holding exclusive buffer lock.
+		 */
+		if (attno <= 0)
+		{
+			/* The only allowed system columns are OIDs, so do this */
+			if (DatumGetObjectId(datum1) != DatumGetObjectId(datum2))
+				return false;
+		}
+		else
+		{
+			Assert(attno <= itupdesc->natts);
+			att = itupdesc->attrs[attno - 1];
+			if(!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+				return false;
+		}
+	}
+
+	/* if we get here, the keys are equal */
+	return true;
+}
+
+/*
  * _bt_vacuum_one_page - vacuum just one index page.
  *
  * Try to remove LP_DEAD items from the given page.  The passed buffer
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 67755d7..53c30d2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -787,15 +787,36 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset, IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	int i;
+	Size itemsz;
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
+	/* Handle compressed tuples here. */
+	for (i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple.*/
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page.*/
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
+	/* Fix the page.
+	 * After dealing with posting tuples,
+	 * just delete all tuples to be deleted.
+	 */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
 
@@ -824,12 +845,28 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
 		/*
+		 * Here we should save offnums and remaining tuples themselves.
+		 * It's important to restore them in correct order.
+		 * At first, we must handle remaining tuples and only after that
+		 * other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			int i;
+			XLogRegisterBufData(0, (char *) remainingoffset, nremaining * sizeof(OffsetNumber));
+			for (i = 0; i < nremaining; i++)
+				XLogRegisterBufData(0, (char *) remaining[i], IndexTupleSize(remaining[i]));
+		}
+
+		/*
 		 * The target-offsets array is not in the buffer, but pretend that it
 		 * is.  When XLogInsert stores the whole buffer, the offsets array
 		 * need not be stored too.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..39e125f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,7 +74,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			 BTCycleId cycleid);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 			 BlockNumber orig_blkno);
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+						ItemPointerData *items,int nitem, int *nremaining);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -861,7 +862,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0, vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -962,6 +963,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -998,6 +1002,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1011,31 +1016,75 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
-
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if(BtreeTupleIsPosting(itup))
+				{
+					ItemPointer newipd;
+					int 		nipd,
+								nnewipd;
+
+					nipd = BtreeGetNPosting(itup);
+
+					/*
+					 * Delete from the posting list all ItemPointers
+					 * which are no more valid. newipd contains list of remainig
+					 * ItemPointers or NULL if none of the items need to be removed.
+					 */
+					newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+					if (newipd != NULL)
+					{
+						if (nnewipd > 0)
+						{
+							/*
+							 * There are still some live tuples in the posting.
+							 * We should update this tuple in place. It'll be done later
+							 * in _bt_delitems_vacuum(). To do that we need to save
+							 * information about the tuple. remainingoffset - offset of the
+							 * old tuple to be deleted. And new tuple to insert on the same
+							 * position, which contains remaining ItemPointers.
+							 */
+							remainingoffset[nremaining] = offnum;
+							remaining[nremaining] = BtreeReformPackedTuple(itup, newipd, nnewipd);
+							nremaining++;
+						}
+						else
+						{
+							/*
+							 * If all ItemPointers should be deleted,
+							 * we can delete this tuple in a regular way.
+							 */
+							deletable[ndeletable++] = offnum;
+						}
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1043,7 +1092,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			BlockNumber	lastBlockVacuumed = InvalidBlockNumber;
 
@@ -1070,7 +1119,7 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, remainingoffset, remaining, nremaining,
 								lastBlockVacuumed);
 
 			/*
@@ -1160,3 +1209,50 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+/*
+ * btreevacuumPosting() -- vacuums a posting list.
+ * The size of the list must be specified via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+				   int nitem, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	ItemPointer tmpitems = NULL;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+
+	/*
+	 * Iterate over TIDs array
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (callback(items + i, callback_state))
+		{
+			if (!tmpitems)
+			{
+				/*
+				 * First TID to be deleted: allocate memory to hold the
+				 * remaining items.
+				 */
+				tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+			}
+		}
+		else
+		{
+			if (tmpitems)
+				tmpitems[remaining] = items[i];
+			remaining++;
+		}
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 14dffe0..2cb1769 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -1134,6 +1136,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	IndexTuple	itup;
 	bool		continuescan;
+	int 		i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1168,6 +1171,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1188,8 +1192,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
 			}
 			if (!continuescan)
 			{
@@ -1201,7 +1216,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			offnum = OffsetNumberNext(offnum);
 		}
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1209,7 +1224,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPackedIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1219,8 +1234,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1234,8 +1261,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1261,6 +1288,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size itupsz = BtreeGetPostingOffset(itup);
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..906b9df 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,7 +75,7 @@
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
-
+#include "catalog/catalog.h"
 
 /*
  * Status record for spooling/sorting phase.  (Note we may have two of
@@ -136,6 +136,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 			 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static SortSupport _bt_prepare_SortSupport(BTWriteState *wstate, int keysz);
+static int	_bt_call_comparator(SortSupport sortKeys, int i,
+				IndexTuple itup, IndexTuple itup2, TupleDesc tupdes);
 static void _bt_load(BTWriteState *wstate,
 		 BTSpool *btspool, BTSpool *btspool2);
 
@@ -527,15 +530,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(last_off > P_FIRSTKEY);
 		ii = PageGetItemId(opage, last_off);
 		oitup = (IndexTuple) PageGetItem(opage, ii);
-		_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
 
 		/*
-		 * Move 'last' into the high key position on opage
+		 * If the item is PostingTuple, we can cut it, because HIKEY
+		 * is not considered as real data, and it need not to keep any
+		 * ItemPointerData at all. And of course it need not to keep
+		 * a list of ipd.
+		 * But, if it had a big posting list, there will be plenty of
+		 * free space on the opage. In that case we must split posting
+		 * tuple into 2 pieces.
 		 */
-		hii = PageGetItemId(opage, P_HIKEY);
-		*hii = *ii;
-		ItemIdSetUnused(ii);	/* redundant */
-		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		 if (BtreeTupleIsPosting(oitup))
+		 {
+			IndexTuple  keytup;
+			Size 		keytupsz;
+			int 		nipd,
+						ntocut,
+						ntoleave;
+
+			nipd = BtreeGetNPosting(oitup);
+			ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+			ntocut++; /* round up to be sure that we cut enough */
+			ntoleave = nipd - ntocut;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(oitup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, oitup, keytupsz);
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+			if (ntocut < nipd)
+			{
+				ItemPointerData *newipd;
+				IndexTuple		newitup,
+								newlasttup;
+				/*
+				 * 1) Cut part of old tuple to shift to npage.
+				 * And insert it as P_FIRSTKEY.
+				 * This tuple is based on keytup.
+				 * Blkno & offnum are reset in BtreeFormPackedTuple.
+				 */
+				newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+				/* Note, that we cut last 'ntocut' items */
+				memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+				newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+				_bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+				pfree(newipd);
+				pfree(newitup);
+
+				/*
+				 * 2) set last item to the P_HIKEY linp
+				 * Move 'last' into the high key position on opage
+				 * NOTE: Do this because of indextuple deletion algorithm, which
+				 * doesn't allow to delete an item while we have unused one before it.
+				 */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgrewritetup(wstate->index, InvalidBuffer, opage, P_HIKEY, keytup);
+
+				/* 4) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+				newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+				_bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+				pfree(newlasttup);
+			}
+			else
+			{
+				/* The tuple isn't big enough to split it. Handle it as a regular tuple. */
+
+				/*
+				 * 1) Shift the last tuple to npage.
+				 * Insert it as P_FIRSTKEY.
+				 */
+				_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+				/* 2) set last item to the P_HIKEY linp */
+				/* Move 'last' into the high key position on opage */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgrewritetup(wstate->index, InvalidBuffer, opage, P_HIKEY, keytup);
+
+			}
+			pfree(keytup);
+		 }
+		 else
+		 {
+			/*
+			 * 1) Shift the last tuple to npage.
+			 * Insert it as P_FIRSTKEY.
+			 */
+			_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+			/* 2) set last item to the P_HIKEY linp */
+			/* Move 'last' into the high key position on opage */
+			hii = PageGetItemId(opage, P_HIKEY);
+			*hii = *ii;
+			ItemIdSetUnused(ii);	/* redundant */
+			((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		}
 
 		/*
 		 * Link the old page into its parent, using its minimum key. If we
@@ -547,6 +655,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 		Assert(state->btps_minkey != NULL);
 		ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
 		_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
 		pfree(state->btps_minkey);
 
@@ -554,8 +663,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * Save a copy of the minimum key for the new page.  We have to copy
 		 * it off the old page, not the new one, in case we are not at leaf
 		 * level.
+		 * We can not just copy oitup, because it could be posting tuple
+		 * and it's more safe just to get new inserted hikey.
 		 */
-		state->btps_minkey = CopyIndexTuple(oitup);
+		ItemId iihk = PageGetItemId(opage, P_HIKEY);
+		IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+		state->btps_minkey = CopyIndexTuple(hikey);
 
 		/*
 		 * Set the sibling links for both pages.
@@ -590,7 +703,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
+
+		if (BtreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple  keytup;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+			state->btps_minkey = CopyIndexTuple(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);
 	}
 
 	/*
@@ -670,6 +805,71 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+static SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+	ScanKey		indexScanKey;
+	SortSupport sortKeys;
+	int 		i;
+
+	/* Prepare SortSupport data for each column */
+	indexScanKey = _bt_mkscankey_nodata(wstate->index);
+	sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+	for (i = 0; i < keysz; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+		ScanKey		scanKey = indexScanKey + i;
+		int16		strategy;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = scanKey->sk_collation;
+		sortKey->ssup_nulls_first =
+			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+		sortKey->ssup_attno = scanKey->sk_attno;
+		/* Abbreviation is not supported here */
+		sortKey->abbreviate = false;
+
+		AssertState(sortKey->ssup_attno != 0);
+
+		strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+			BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+		PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+	}
+
+	_bt_freeskey(indexScanKey);
+	return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey on attribute i
+ */
+static int
+_bt_call_comparator(SortSupport sortKeys, int i,
+						 IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+		SortSupport entry;
+		Datum		attrDatum1,
+					attrDatum2;
+		bool		isNull1,
+					isNull2;
+		int32		compare;
+
+		entry = sortKeys + i - 1;
+		attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+		attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+		compare = ApplySortComparator(attrDatum1, isNull1,
+										attrDatum2, isNull2,
+										entry);
+
+		return compare;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -679,16 +879,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
-				itup2 = NULL;
+				itup2 = NULL,
+				itupprev = NULL;
 	bool		should_free,
 				should_free2,
 				load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = RelationGetNumberOfAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
+	int			ntuples = 0;
 	SortSupport sortKeys;
 
+	/* Prepare SortSupport structure for indextuples comparison */
+	sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
 	if (merge)
 	{
 		/*
@@ -701,34 +905,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 									   true, &should_free);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate,
 										true, &should_free2);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
-		/* Prepare SortSupport data for each column */
-		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
-		for (i = 0; i < keysz; i++)
-		{
-			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
-			int16		strategy;
-
-			sortKey->ssup_cxt = CurrentMemoryContext;
-			sortKey->ssup_collation = scanKey->sk_collation;
-			sortKey->ssup_nulls_first =
-				(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
-			sortKey->ssup_attno = scanKey->sk_attno;
-			/* Abbreviation is not supported here */
-			sortKey->abbreviate = false;
-
-			AssertState(sortKey->ssup_attno != 0);
-
-			strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
-				BTGreaterStrategyNumber : BTLessStrategyNumber;
-
-			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
-		}
-
-		_bt_freeskey(indexScanKey);
 
 		for (;;)
 		{
@@ -742,20 +918,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			{
 				for (i = 1; i <= keysz; i++)
 				{
-					SortSupport entry;
-					Datum		attrDatum1,
-								attrDatum2;
-					bool		isNull1,
-								isNull2;
-					int32		compare;
-
-					entry = sortKeys + i - 1;
-					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
-					attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
-					compare = ApplySortComparator(attrDatum1, isNull1,
-												  attrDatum2, isNull2,
-												  entry);
+					int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
 					if (compare > 0)
 					{
 						load1 = false;
@@ -794,16 +958,123 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	else
 	{
 		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+		Relation indexRelation = wstate->index;
+		Form_pg_index index = indexRelation->rd_index;
+
+		if (IsSystemRelation(indexRelation) || index->indisunique)
+		{
+			/* Do not use compression. */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
+
+				_bt_buildadd(wstate, state, itup);
+				if (should_free)
+					pfree(itup);
+			}
+		}
+		else
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			ItemPointerData *ipd = NULL;
+			IndexTuple 		postingtuple;
+			Size			maxitemsize = 0,
+							maxpostingsize = 0;
 
-			_bt_buildadd(wstate, state, itup);
-			if (should_free)
-				pfree(itup);
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				/*
+				 * Compare current tuple with previous one.
+				 * If tuples are equal, we can unite them into a posting list.
+				 */
+				if (itupprev != NULL)
+				{
+					if (_bt_isbinaryequal(tupdes, itupprev, index->indnatts, itup))
+					{
+						/* Tuples are equal. Create or update posting */
+						if (ntuples == 0)
+						{
+							/*
+							 * We haven't suitable posting list yet, so allocate
+							 * it and save both itupprev and current tuple.
+							 */
+							ipd = palloc0(maxitemsize);
+
+							memcpy(ipd, itupprev, sizeof(ItemPointerData));
+							ntuples++;
+							memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+							ntuples++;
+						}
+						else
+						{
+							if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+							{
+								memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+								ntuples++;
+							}
+							else
+							{
+								postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+								_bt_buildadd(wstate, state, postingtuple);
+								ntuples = 0;
+								pfree(ipd);
+							}
+						}
+
+					}
+					else
+					{
+						/* Tuples are not equal. Insert itupprev into index. */
+						if (ntuples == 0)
+							_bt_buildadd(wstate, state, itupprev);
+						else
+						{
+							postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+							_bt_buildadd(wstate, state, postingtuple);
+							ntuples = 0;
+							pfree(ipd);
+						}
+					}
+				}
+
+				/*
+				 * Copy the tuple into temp variable itupprev
+				 * to compare it with the following tuple
+				 * and maybe unite them into a posting tuple
+				 */
+				itupprev = CopyIndexTuple(itup);
+				if (should_free)
+					pfree(itup);
+
+				/* compute max size of ipd list */
+				maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+			}
+
+			/* Handle the last item.*/
+			if (ntuples == 0)
+			{
+				if (itupprev != NULL)
+					_bt_buildadd(wstate, state, itupprev);
+			}
+			else
+			{
+				Assert(ipd!=NULL);
+				Assert(itupprev != NULL);
+				postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+				_bt_buildadd(wstate, state, postingtuple);
+				ntuples = 0;
+				pfree(ipd);
+			}
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index b714b2c..53fcbcc 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1814,7 +1814,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BtreeTupleIsPosting(ituple)
+				&& (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2056,3 +2058,69 @@ btoptions(Datum reloptions, bool validate)
 {
 	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
 }
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	uint32	   newsize;
+	IndexTuple itup = CopyIndexTuple(tuple);
+
+	/*
+	 * Determine and store offset to the posting list.
+	 */
+	newsize = IndexTupleSize(itup);
+	newsize = SHORTALIGN(newsize);
+
+	/*
+	 * Set meta info about the posting list.
+	 */
+	BtreeSetPostingOffset(itup, newsize);
+	BtreeSetNPosting(itup, nipd);
+	/*
+	 * Add space needed for posting list, if any.  Then check that the tuple
+	 * won't be too big to store.
+	 */
+	newsize += sizeof(ItemPointerData)*nipd;
+	newsize = MAXALIGN(newsize);
+
+	/*
+	 * Resize tuple if needed
+	 */
+	if (newsize != IndexTupleSize(itup))
+	{
+		itup = repalloc(itup, newsize);
+
+		/*
+		 * PostgreSQL 9.3 and earlier did not clear this new space, so we
+		 * might find uninitialized padding when reading tuples from disk.
+		 */
+		memset((char *) itup + IndexTupleSize(itup),
+			   0, newsize - IndexTupleSize(itup));
+		/* set new size in tuple header */
+		itup->t_info &= ~INDEX_SIZE_MASK;
+		itup->t_info |= newsize;
+	}
+
+	/*
+	 * Copy data into the posting tuple
+	 */
+	memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+	return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int size;
+	if (BtreeTupleIsPosting(tuple))
+	{
+		size = BtreeGetPostingOffset(tuple);
+		tuple->t_info &= ~INDEX_SIZE_MASK;
+		tuple->t_info |= size;
+	}
+
+	return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0d094ca..6ced76c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -475,14 +475,40 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			OffsetNumber *offset;
+			IndexTuple 	remaining;
+			int 		i;
+			Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+			offset = (OffsetNumber *) ptr;
+			remaining = (IndexTuple)(ptr + xlrec->nremaining*sizeof(OffsetNumber));
+
+			/* Handle posting tuples */
+			for(i = 0; i < xlrec->nremaining; i++)
+			{
+				PageIndexTupleDelete(page, offset[i]);
+
+				itemsz = IndexTupleSize(remaining);
+				itemsz = MAXALIGN(itemsz);
+
+				if (PageAddItem(page, (Item) remaining, itemsz, offset[i],
+						false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				remaining = (IndexTuple)((char*) remaining + itemsz);
+			}
+
+			if (xlrec->ndeleted > 0)
+			{
+				OffsetNumber *unused;
+				OffsetNumber *unend;
+
+				unused = (OffsetNumber *) ((char *)remaining);
+				unend = (OffsetNumber *) ((char *) ptr + len);
+
+				if ((unend - unused) > 0)
+					PageIndexMultiDelete(page, unused, unend - unused);
+			}
 		}
 
 		/*
@@ -713,6 +739,75 @@ btree_xlog_delete(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 }
 
+/*
+ * Applies changes performed by _bt_pgupdtup().
+ * TODO Add some stuff for inner pages. Don't sure if we really need it?
+ * See comment in _bt_pgupdtup().
+ */
+static void
+btree_xlog_update(bool isleaf, XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+		ItemPointerData *ipd;
+		IndexTuple 		olditup,
+						newitup;
+		Size 			newitupsz;
+		int				nipd;
+
+		/*TODO Following code needs some refactoring. Maybe one more function.*/
+		page = BufferGetPage(buffer);
+
+		olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, xlrec->offnum));
+
+		if (!BtreeTupleIsPosting(olditup))
+			nipd = 1;
+		else
+			nipd = BtreeGetNPosting(olditup);
+
+		ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+		/* copy item pointers from old tuple into ipd */
+		if (BtreeTupleIsPosting(olditup))
+			memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+		else
+			memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd+nipd, (Item) datapos, sizeof(ItemPointerData));
+
+		newitup = BtreeReformPackedTuple((Item) datapos, ipd, nipd+1);
+
+		/*
+		* Update the tuple in place. We have already checked that the
+		* new tuple would fit into this page, so it's safe to delete
+		* old tuple and insert the new one without any side effects.
+		*/
+		newitupsz = IndexTupleDSize(*newitup);
+		newitupsz = MAXALIGN(newitupsz);
+
+		PageIndexTupleDelete(page, xlrec->offnum);
+
+		if (PageAddItem(page, (Item) newitup, newitupsz, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "failed to update compressed tuple while doing recovery");
+
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
 static void
 btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 {
@@ -988,6 +1083,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_INSERT_META:
 			btree_xlog_insert(false, true, record);
 			break;
+		case XLOG_BTREE_UPDATE_TUPLE:
+			btree_xlog_update(true, record);
+			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, false, record);
 			break;
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..3dd19c0 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -138,7 +138,6 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
 			(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
 
-
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
 				 Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9046b16..5496e94 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,6 +122,15 @@ typedef struct BTMetaPageData
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
 /*
+ * If compression is applied, the page could contain more tuples
+ * than if it has only uncompressed tuples, so we need new max value.
+ * Note that it is a rough upper estimate.
+ */
+#define MaxPackedIndexTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+			(sizeof(ItemPointerData))))
+
+/*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
@@ -226,6 +235,7 @@ typedef struct BTMetaPageData
 										 * vacuum */
 #define XLOG_BTREE_REUSE_PAGE	0xD0	/* old page is about to be reused from
 										 * FSM */
+#define XLOG_BTREE_UPDATE_TUPLE	0xE0	/* update index tuple in place */
 
 /*
  * All that we need to regenerate the meta-data page
@@ -348,15 +358,31 @@ typedef struct xl_btree_reuse_page
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
  * have a zero length array of offsets. Earlier records must have at least one.
+ * TODO: update this comment
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/* 
+	 * This filed helps us to find beginning of the remining tuples
+	 * which follow array of offset numbers.
+	 */
+	int			nremaining;
+
+	/*
+	 * TODO: Don't sure if we do need following variable,
+	 * maybe just a flag would be enough to determine
+	 * if there is some data about deleted tuples
+	 */
+	int			ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber) + 2*sizeof(int))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -538,6 +564,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for Posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -550,7 +578,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -650,6 +678,27 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+	BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+	((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup)			BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n)		ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup)		(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n)	ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup)    	(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup)			(ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n)		(ItemPointerData*) (BtreeGetPosting(itup) + n)
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -683,6 +732,10 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_pgupdtup(Relation rel, Buffer buf, OffsetNumber offset, IndexTuple itup,
+			 IndexTuple olditup, int nipd);
+extern void _bt_pgrewritetup(Relation rel, Buffer buf, Page page, OffsetNumber offset, IndexTuple itup);
+extern bool _bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup, int nindatts, IndexTuple ituptoinsert);
 
 /*
  * prototypes for functions in nbtpage.c
@@ -702,6 +755,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -714,8 +769,8 @@ extern BTStack _bt_search(Relation rel,
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
 			  int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+								ScanKey scankey, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
 			Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -746,6 +801,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c

#32

Alexandr Popov

a.popov@postgrespro.ru

almost 10 years ago

In reply to: Anastasia Lubennikova (#31)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 18.03.2016 20:19, Anastasia Lubennikova wrote:

Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any
feedback.
But please, don't complain about code style, it is still work in
progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

Hi, hackers.

It's my first review, so do not be strict to me.

I have tested this patch on the next table:
create table message
(
id serial,
usr_id integer,
text text
);
CREATE INDEX message_usr_id ON message (usr_id);
The table has 10000000 records.

I found the following:
The less unique keys the less size of the table.

Next 2 tablas demonstrates it.
New B-tree
Count of unique keys (usr_id), indexï¿½s size , time of creation
10000000 ;"214 MB" ;"00:00:34.193441"
3333333 ;"214 MB" ;"00:00:45.731173"
2000000 ;"129 MB" ;"00:00:41.445876"
1000000 ;"129 MB" ;"00:00:38.455616"
100000 ;"86 MB" ;"00:00:40.887626"
10000 ;"79 MB" ;"00:00:47.199774"

Old B-tree
Count of unique keys (usr_id), indexï¿½s size , time of creation
10000000 ;"214 MB" ;"00:00:35.043677"
3333333 ;"286 MB" ;"00:00:40.922845"
2000000 ;"300 MB" ;"00:00:46.454846"
1000000 ;"278 MB" ;"00:00:42.323525"
100000 ;"287 MB" ;"00:00:47.438132"
10000 ;"280 MB" ;"00:01:00.307873"

I inserted data randomly and sequentially, it did not influence the
index's size.
Time of select, insert and update random rows is not changed. It is
great, but certainly it needs some more detailed study.

Alexander Popov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#33

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#31)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in this
feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message couple of
sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression for
each particular index.
4. Recheck locking considerations. I tried to write code as less invasive as
possible, but we need to make sure that algorithm is still correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 10 years ago

In reply to: Robert Haas (#33)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Thu, Mar 24, 2016 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in

this

feature,
to help with reviewing and testing it. I will be grateful for any

feedback.

But please, don't complain about code style, it is still work in

progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message

couple of

sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression

for

each particular index.
4. Recheck locking considerations. I tried to write code as less

invasive as

possible, but we need to make sure that algorithm is still correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.

It's all true. But:
1) It's a great feature many users dream about.
2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes. It just
change some well-isolated placed.

Let's give it a chance. I've signed as additional reviewer and I'll do my
best in spotting all possible issues in this patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#35

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Robert Haas (#33)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.

Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.

I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Alexander Korotkov (#34)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 3/24/16 10:21 AM, Alexander Korotkov wrote:

1) It's a great feature many users dream about.

Doesn't matter if it starts eating their data...

2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes. It just
change some well-isolated placed.

It doesn't really matter how big the patch is, it's a question of "What
did the patch fail to consider?". With something as complicated as the
btree code, there's ample opportunities for missing things. (And FWIW,
I'd argue that a 51kB patch is certainly not small, and a patch that is
doing things in critical sections isn't terribly isolated).

I do think this will be a great addition, but it's just too late to be
adding this to 9.6.

(BTW, I'm getting bounces from a.lebedev@postgrespro.ru, as well as
postmaster@. I emailed info@postgrespro.ru about this but never heard back.)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Peter Geoghegan (#35)

Re: [WIP] Effective storage of duplicates in B-tree index.

25.03.2016 01:12, Peter Geoghegan:

On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest.

You're right.
Frankly, I thought that someone will help me with the path, but I had to
finish it myself.
*off-topic*
I wonder, if we can add new flag to commitfest. Something like "Needs
assistance",
which will be used to mark big and complicated patches in progress.
While "Needs review" means that the patch is almost ready and only
requires the final review.

This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.

Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.

I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.

Unfortunately, I must agree. This patch seems to be far from final
version until the feature freeze.
I'll move it to the future commitfest.

Anyway it means, that now we have more time to improve the patch.
If you have any ideas related to this patch like prefix/suffix
compression, I'll be glad to discuss them.
Same for any other ideas of B-tree optimization.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Claudio Freire

klaussfreire@gmail.com

almost 10 years ago

In reply to: Peter Geoghegan (#35)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.

Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.

I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.

I've been toying (having gotten nowhere concrete really) with prefix
compression myself, I agree that messing with btree code is quite
harder than it ought to be.

Perhaps trying experimental format changes in a separate experimental
am wouldn't be all that bad (say, nxbtree?). People could opt-in to
those, by creating the indexes with nxbtree instead of plain btree
(say in development environments) and get some testing going without
risking much.

Normally the same effect should be achievable with mere flags, but
since format changes to btree tend to be rather invasive, ensuring the
patch doesn't change behavior with the flag off is hard as well, hence
the wholly separate am idea.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Heikki Linnakangas

hlinnaka@iki.fi

over 9 years ago

In reply to: Anastasia Lubennikova (#31)

Re: [WIP] Effective storage of duplicates in B-tree index.

On 18/03/16 19:19, Anastasia Lubennikova wrote:

Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

I think we should pack the TIDs more tightly, like GIN does with the
varbyte encoding. It's tempting to commit this without it for now, and
add the compression later, but I'd like to avoid having to deal with
multiple binary-format upgrades, so let's figure out the final on-disk
format that we want, right from the beginning.

It would be nice to reuse the varbyte encoding code from GIN, but we
might not want to use that exact scheme for B-tree. Firstly, an
important criteria when we designed GIN's encoding scheme was to avoid
expanding on-disk size for any data set, which meant that a TID had to
always be encoded in 6 bytes or less. We don't have that limitation with
B-tree, because in B-tree, each item is currently stored as a separate
IndexTuple, which is much larger. So we are free to choose an encoding
scheme that's better at packing some values, at the expense of using
more bytes for other values, if we want to. Some analysis on what we
want would be nice. (It's still important that removing a TID from the
list never makes the list larger, for VACUUM.)

Secondly, to be able to just always enable this feature, without a GUC
or reloption, we might need something that's faster for random access
than GIN's posting lists. Or can we just add the setting, but it would
be nice to have some more analysis on the worst-case performance before
we decide on that.

I find the macros in nbtree.h in the patch quite confusing. They're
similar to what we did in GIN, but again we might want to choose
differently here. So some discussion on the desired IndexTuple layout is
in order. (One clear bug is that using the high bit of BlockNumber for
the BT_POSTING flag will fail for a table larger than 2^31 blocks.)

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Heikki Linnakangas (#39)

Re: [WIP] Effective storage of duplicates in B-tree index.

On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I think we should pack the TIDs more tightly, like GIN does with the varbyte
encoding. It's tempting to commit this without it for now, and add the
compression later, but I'd like to avoid having to deal with multiple
binary-format upgrades, so let's figure out the final on-disk format that we
want, right from the beginning.

While the idea of duplicate storage is pretty obviously compelling,
there could be other, non-obvious benefits. I think that it could
bring further benefits if we could use duplicate storage to change
this property of nbtree (this is from the README):

"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page. This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up). Therefore we assume
Ki <= v <= Ki+1 instead. A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page. An
insertion that sees the high key of its target page is equal to the key
to be inserted has a choice whether or not to move right, since the new
key could go on either page. (Currently, we try to find a page where
there is room for the new key without a split.)

"""

If we could *guarantee* that all keys in the index are unique, then we
could maintain the keyspace as L&Y originally described.

The practical benefits to this would be:

* We wouldn't need to take the extra step described above -- finding a
bounding key/separator key that's fully equal to our scankey would no
longer necessitate a probably-useless descent to the left of that key.
(BTW, I wonder if we could get away with not inserting a downlink into
parent when a leaf page split finds an identical IndexTuple in parent,
*without* changing the keyspace invariant I mention -- if we're always
going to go to the left of an equal-to-scankey key in an internal
page, why even have more than one?)

* This would make suffix truncation of internal index tuples easier,
and that's important.

The traditional reason why suffix truncation is important is that it
can keep the tree a lot shorter than it would otherwise be. These
days, that might not seem that important, because even if you have
twice the number of internal pages than strictly necessary, that still
isn't that many relative to typical main memory size (and even CPU
cache sizes, perhaps).

The reason I think it's important these days is that not having suffix
truncation makes our "separator keys" overly prescriptive about what
part of the keyspace is owned by each internal page. With a pristine
index (following REINDEX), this doesn't matter much. But, I think that
we get much bigger problems with index bloat due to the poor fan-out
that we sometimes see due to not having suffix truncation, *combined*
with the page deletion algorithms restriction on deleting internal
pages (it can only be done for internal pages with *no* children).

Adding another level or two to the B-Tree makes it so that your
workload's "sparse deletion patterns" really don't need to be that
sparse in order to bloat the B-Tree badly, necessitating a REINDEX to
get back to acceptable performance (VACUUM won't do it). To avoid
this, we should make the internal pages represent the key space in the
least restrictive way possible, by applying suffix truncation so that
it's much more likely that things will *stay* balanced as churn
occurs. This is probably a really bad problem with things like
composite indexes over text columns, or indexes with many NULL values.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Anastasia Lubennikova (#31)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.

[feature overview]
This patch implements the deduplication of btree non-pivot tuples on
leaf pages
in a manner similar to GIN index "posting lists".

Non-pivot posting tuple has following format:
t_tid | t_info | key values | posting_list[]

Where t_tid and t_info fields are used to store meta info
about tuple's posting list.
posting list is an array of ItemPointerData.

Currently, compression is applied to all indexes except system indexes,
unique
indexes, and indexes with included columns.

On insertion, compression applied not to each tuple, but to the page before
split. If the target page is full, we try to compress it.

[benchmark results]
idx ON tbl(c1);
index contains 10000000 integer values

i - number of distinct values in the index.
So i=1 means that all rows have the same key,
and i=10000000 means that all keys are different.

i / old size (MB) / new size (MB)
1ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ 215ï¿½ï¿½ï¿½ 88
1000ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ 215ï¿½ï¿½ï¿½ 90
100000ï¿½ï¿½ï¿½ ï¿½ï¿½ï¿½ 215ï¿½ï¿½ï¿½ 71
10000000ï¿½ï¿½ï¿½ 214ï¿½ï¿½ï¿½ 214

For more, see the attached diagram with test results.

[future work]
Many things can be improved in this feature.
Personally, I'd prefer to keep this patch as small as possible
and work on other improvements after a basic part is committed.
Though, I understand that some of these can be considered essential
for this patch to be approved.

1. Implement a split of the posting tuples on a page split.
2. Implement microvacuum of posting tuples.
3. Add a flag into pg_index, which allows enabling/disabling compression
for a particular index.
4. Implement posting list compression.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Attachments:

btree_compression_pg12_v1.patchtext/x-patch; name=btree_compression_pg12_v1.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..fce499b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
 			}
 
 			/*
+			 * Before considering moving right, try to compress the page
+			 */
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+			/*
 			 * Nope, so check conditions (b) and (c) enumerated above
 			 *
 			 * The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,232 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple to_insert;
+	OffsetNumber offnum =  PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple postingtuple;
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											compressState->ipd,
+											compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+			 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+	{
+		elog(DEBUG4, "insert_itupprev_to_page. failed");
+		/*
+		 * this may happen if tuple is bigger than freespace
+		 * fallback to uncompressed page case
+		 */
+		if (compressState->ntuples > 0)
+			pfree(to_insert);
+		return false;
+	}
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+	return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool use_compression = false;
+	BTCompressState *compressState = NULL;
+	int n_posting_on_page = 0;
+	int natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns,
+	 * system indexes and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+									  IndexRelationGetNumberOfAttributes(rel))
+									  && (!IsSystemRelation(rel))
+									  && (!rel->rd_index->indisunique));
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Heuristic to avoid trying to compress page
+	 * that has already contain mostly compressed items
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemid = PageGetItemId(page, P_HIKEY);
+		IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (BTreeTupleIsPosting(item))
+			n_posting_on_page++;
+	}
+	/*
+	 * If we have only 10 uncompressed items on the full page,
+	 * it probably won't worth to compress them.
+	 */
+	if (maxoff - n_posting_on_page < 10)
+		return;
+
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId	itemid = PageGetItemId(page, P_HIKEY);
+		Size itemsz = ItemIdGetLength(itemid);
+		IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+		{
+			/*
+			 * Should never happen. Anyway, fallback gently to scenario of
+			 * incompressible page and just return from function.
+			 */
+			elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+			return;
+		}
+	}
+
+	/* Iterate over tuples on the page, try to compress them into posting lists
+	 * and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		/*
+		 * We do not expect to meet any DEAD items, since this
+		 * function is called right after _bt_vacuum_one_page().
+		 * If for some reason we found dead item, don't compress it,
+		 * to allow upcoming microvacuum or vacuum clean it up.
+		 */
+		if(ItemIdIsDead(itemId))
+			continue;
+
+		if (compressState->itupprev != NULL)
+		{
+			int n_equal_atts = _bt_keep_natts_fast(rel,
+														compressState->itupprev, itup);
+			int itup_ntuples = BTreeTupleIsPosting(itup)?BTreeTupleGetNPosting(itup):1;
+
+			if (n_equal_atts > natts)
+			{
+				/* Tuples are equal. Create or update posting. */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples+1)*sizeof(ItemPointerData)))))
+					add_item_to_posting(compressState, itup);
+				else
+					/* If posting is too big, insert it on page and continue.*/
+					if (!insert_itupprev_to_page(newpage, compressState))
+					{
+						elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+						return;
+					}
+			}
+			else
+			{
+				/*
+				 * Tuples are not equal. Insert itupprev into index.
+				 * Save current tuple for the next iteration.
+				 */
+				if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev
+		 * to compare it with the following tuple
+		 * and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <=  compressState->maxitemsize);
+	}
+
+	/* Handle the last item.*/
+	if (!insert_itupprev_to_page(newpage, compressState))
+	{
+		elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+		return;
+	}
+
+	START_CRIT_SECTION();
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr recptr;
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+	return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index de4d4ef..681077f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1024,14 +1024,54 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	int		i;
+	Size	itemsz;
+	Size	remaining_sz = 0;
+	char   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size offset = 0;
+
+		for (i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple.*/
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page.*/
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+							false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1061,6 +1101,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
+
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1074,6 +1117,20 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves.
+		 * It's important to restore them in correct order.
+		 * At first, we must handle remaining tuples and only after that
+		 * other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac..5a7d7bd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,8 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
-
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+									  IndexTuple itup, int *nremaining);
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
  * and callbacks.
@@ -1069,7 +1069,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0, vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1193,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1232,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1246,77 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int 		nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted,
+						 * we can delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain.
+						 * Do nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1324,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1341,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1427,43 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			i,
+				remaining	= 0;
+	int			nitem		= BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems	= NULL,
+				items		= BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list,
+	 * save alive tuples into tmpitems
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..594936d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -1410,6 +1412,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int 		i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1459,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1494,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1542,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1550,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1592,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1621,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1635,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1649,34 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem =  &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
@@ -2221,6 +2283,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
+
 	scan->xs_heaptid = currItem->heapTid;
 	if (scan->xs_want_itup)
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..59f702b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "miscadmin.h"
@@ -76,6 +77,7 @@
 #include "utils/tuplesort.h"
 
 
+
 /* Magic numbers for parallel state sharing */
 #define PARALLEL_KEY_BTREE_SHARED		UINT64CONST(0xA000000000000001)
 #define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xA000000000000002)
@@ -288,6 +290,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+						BTPageState *state, BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple,
+			 * we'll truncate its posting list in _bt_truncate as well.
+			 * Note that it is also applicable only to leaf pages,
+			 * since internal pages never contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(!BTreeTupleIsPosting(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1050,8 +1060,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
-		/* _bt_sortaddtup() will perform full truncation later */
+
+		/*
+		 * Stashed copy must be a non-posting tuple,
+		 * with truncated posting list and correct t_tid
+		 * since we're going to use it to build downlink.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple  keytup;
+
+			/*
+			 * Form key tuple, that doesn't contain any ipd.
+			 * NOTE: since we'll need TID later, set t_tid to
+			 * the first t_tid from posting list.
+			 */
+			keytupsz =  BTreeTupleGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy(keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+			state->btps_minkey = CopyIndexTuple(keytup);
+			pfree(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);		/* _bt_sortaddtup() will perform full truncation later */
+
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1137,6 +1174,87 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState)
+{
+	IndexTuple to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple postingtuple;
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+	int nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData)*nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup), sizeof(ItemPointerData)*nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1150,9 +1268,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns,
+	 * system indexes and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+									  IndexRelationGetNumberOfAttributes(wstate->index))
+									  && (!IsSystemRelation(wstate->index))
+									  && (!wstate->index->rd_index->indisunique));
 
 	if (merge)
 	{
@@ -1266,19 +1396,83 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+	
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+														   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/* Tuples are equal. Create or update posting. */
+						if ((compressState->ntuples+1)*sizeof(ItemPointerData) < compressState->maxpostingsize)
+							add_item_to_posting(compressState, itup);
+						else
+							/* If posting is too big, insert it on page and continue.*/
+							insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one
+				 * and maybe unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				 compressState->maxpostingsize = compressState->maxitemsize -
+								IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+								MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			insert_itupprev_to_page_buildadd(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..8b77b69 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,26 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * But the tuple is a compressed tuple with a posting list,
+		 * so we still must truncate it.
+		 * 
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2237,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2248,9 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), BTreeTupleGetMinTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, BTreeTupleGetMinTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, BTreeTupleGetMinTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2272,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, BTreeTupleGetMinTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2362,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2451,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2506,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2533,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2585,7 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2612,59 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32	   keysize, newsize;
+	IndexTuple itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert (nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..16224b4 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,36 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				int				i;
+				OffsetNumber   *remainingoffset;
+				IndexTuple		remaining;
+				Size			itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+						xlrec->nremaining * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+							false, false) == InvalidOffsetNumber)
+							elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple)((char*) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f2..57ee21e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * BTREE_VERSION 5 introduced new format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,149 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple
+ * or non-pivot posting tuple,
  * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
  * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
 
-/* Get/set downlink block number */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ * 
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size			maxitemsize;
+	Size			maxpostingsize;
+	IndexTuple 		itupprev;
+	int 			ntuples;
+	ItemPointerData	*ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) BTreeTupleGetPosting(itup) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,15 +484,18 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
+
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +503,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +514,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +526,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +932,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -813,7 +984,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
-
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple,
+									   ItemPointerData *ipd, int nipd);
 /*
  * prototypes for functions in nbtvalidate.c
  */
@@ -825,5 +997,6 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
-
+extern void add_item_to_posting(BTCompressState *compressState,
+								IndexTuple itup);
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..c213bfa 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -172,11 +172,19 @@ typedef struct xl_btree_reuse_page
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
+	/*
+	 * This field helps us to find beginning of the remaining tuples
+	 * from postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

btree_compression_test_result.pngimage/png; name=btree_compression_test_result.pngDownload

#42

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#41)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

i - number of distinct values in the index.
So i=1 means that all rows have the same key,
and i=10000000 means that all keys are different.

i / old size (MB) / new size (MB)
1 215 88
1000 215 90
100000 215 71
10000000 214 214

For more, see the attached diagram with test results.

I tried this on my own "UK land registry" test data [1]https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com, which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also tried it out with the "Mouse genome informatics" database [2]http://www.informatics.jax.org/software.shtml -- Peter Geoghegan,
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.

I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.

[1]: https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
[2]: http://www.informatics.jax.org/software.shtml -- Peter Geoghegan
--
Peter Geoghegan

#43

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#42)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:

I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also see a huge and consistent space saving for TPC-H. All 9 indexes
are significantly smaller. The lineitem orderkey index is "just" 1/3
smaller, which is the smallest improvement among TPC-H indexes in my
index bloat test case. The two largest indexes after the initial bulk
load are *much* smaller: the lineitem parts supplier index is ~2.7x
smaller, while the lineitem ship date index is a massive ~4.2x
smaller. Also, the orders customer key index is ~2.8x smaller, and the
order date index is ~2.43x smaller. Note that the test involved retail
insertions, not CREATE INDEX.

I haven't seen any regression in the size of any index so far,
including when the number of internal pages is all that we measure.
Actually, there seems to be cases where there is a noticeably larger
reduction in internal pages than in leaf pages, probably because of
interactions with suffix truncation.

This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.

--
Peter Geoghegan

#44

Bruce Momjian

bruce@momjian.us

over 6 years ago

In reply to: Peter Geoghegan (#43)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 4, 2019 at 05:06:09PM -0700, Peter Geoghegan wrote:

This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.

I am also encouraged and am happy we can finally move this duplicate
optimization forward.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#45

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#41)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.

I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.

Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.

The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.

I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:

/*
* Lehman and Yao require that the downlink to the right page, which is to
* be inserted into the parent page in the second phase of a page split be
* a strict lower bound on items on the right page, and a non-strict upper
* bound for items on the left page. Assert that heap TIDs follow these
* invariants, since a heap TID value is apparently needed as a
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
BTreeTupleGetMinTID(firstright)) < 0);
...

This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)

Other feedback on specific things:

* A good way to assess whether or not the "logical tuple versus
physical tuple" thing works is to make sure that amcheck's
"rootdescend" verification works with a variety of indexes. As I said,
it has the same requirements for nbtree as retail index tuple deletion
will.

* _bt_findinsertloc() should not call _bt_compress_one_page() for
!heapkeyspace (version 3) indexes -- the second call to
_bt_compress_one_page() should be removed.

* Why can't compression be used on system catalog indexes? I
understand that they are not a compelling case, but we tend to do
things the same way with catalog tables and indexes unless there is a
very good reason not to (e.g. HOT, suffix truncation). I see that the
tests fail when that restriction is removed, but I don't think that
that has anything to do with system catalogs. I think that that's due
to a bug somewhere else. Why have this restriction at all?

* It looks like we could be less conservative in nbtsplitloc.c to good
effect. We know for sure that a posting list will be truncated down to
one heap TID even in the worst case, so we can safely assume that the
new high key will be a lot smaller than the firstright tuple that it
is based on when it has a posting list. We only have to keep one TID.
This will allow us to leave more tuples on the left half of the page
in certain cases, further improving space utilization.

* Don't you need to update nbtdesc.c?

* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.

* It looks like amcheck's heapallindexed verification needs to have
normalization added, to avoid false positives. This situation is
specifically anticipated by existing comments above
bt_normalize_tuple(). Again, being careful about "logical versus
physical tuple" seems necessary.

* Doesn't the nbtsearch.c/_bt_readpage() code that deals with
backwards scans need to return posting lists backwards, not forwards?
It seems like a good idea to try to "preserve the logical contents"
here too, just to be conservative.

Within nbtsort.c:

* Is the new code in _bt_buildadd() actually needed? If so, why?

* insert_itupprev_to_page_buildadd() is only called within nbtsort.c,
and so should be static. The name also seems very long.

* add_item_to_posting() is called within both nbtsort.c and
nbtinsert.c, and so should remain non-static, but have less generic
(and shorter) name. (Use the usual _bt_* style instead.)

* Is nbtsort.c the right place for these functions, anyway? (Maybe,
but maybe not, IMV.)

I ran pgindent on the patch, and made some small manual whitespace
adjustments, which is attached. There are no real changes, but some of
the formatting in the original version you posted was hard to read.
Please work off this for your next revision.

--
Peter Geoghegan

Attachments:

0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patchapplication/octet-stream; name=0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patchDownload

From b66157e0ec6aedca19bb4d91a67bff275780c11b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 4 Jul 2019 09:48:51 -0700
Subject: [PATCH 1/4] btree_compression_pg12_v1.patch with pg_indent run

---
 src/backend/access/nbtree/nbtinsert.c | 252 ++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtpage.c   |  54 ++++++
 src/backend/access/nbtree/nbtree.c    | 143 ++++++++++++---
 src/backend/access/nbtree/nbtsearch.c |  78 +++++++-
 src/backend/access/nbtree/nbtsort.c   | 228 +++++++++++++++++++++--
 src/backend/access/nbtree/nbtutils.c  | 119 +++++++++++-
 src/backend/access/nbtree/nbtxlog.c   |  35 +++-
 src/include/access/itup.h             |   5 +
 src/include/access/nbtree.h           | 197 ++++++++++++++++++--
 src/include/access/nbtxlog.h          |  13 +-
 10 files changed, 1046 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f8849d4..600dafe73a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -805,6 +814,11 @@ _bt_findinsertloc(Relation rel,
 					break;		/* OK, now we have enough space */
 			}
 
+			/*
+			 * Before considering moving right, try to compress the page
+			 */
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+
 			/*
 			 * Nope, so check conditions (b) and (c) enumerated above
 			 *
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+	{
+		elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+		/*
+		 * this may happen if tuple is bigger than freespace fallback to
+		 * uncompressed page case
+		 */
+		if (compressState->ntuples > 0)
+			pfree(to_insert);
+		return false;
+	}
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+	return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			n_posting_on_page = 0;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns, system indexes
+	 * and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+						IndexRelationGetNumberOfAttributes(rel))
+					   && (!IsSystemRelation(rel))
+					   && (!rel->rd_index->indisunique));
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Heuristic to avoid trying to compress page that has already contain
+	 * mostly compressed items
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (BTreeTupleIsPosting(item))
+			n_posting_on_page++;
+	}
+
+	/*
+	 * If we have only 10 uncompressed items on the full page, it probably
+	 * won't worth to compress them.
+	 */
+	if (maxoff - n_posting_on_page < 10)
+		return;
+
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+		{
+			/*
+			 * Should never happen. Anyway, fallback gently to scenario of
+			 * incompressible page and just return from function.
+			 */
+			elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+			return;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		/*
+		 * We do not expect to meet any DEAD items, since this function is
+		 * called right after _bt_vacuum_one_page(). If for some reason we
+		 * found dead item, don't compress it, to allow upcoming microvacuum
+		 * or vacuum clean it up.
+		 */
+		if (ItemIdIsDead(itemId))
+			continue;
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					add_item_to_posting(compressState, itup);
+				}
+				else if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+			else
+			{
+				/*
+				 * Tuples are not equal. Insert itupprev into index. Save
+				 * current tuple for the next iteration.
+				 */
+				if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	if (!insert_itupprev_to_page(newpage, compressState))
+	{
+		elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+		return;
+	}
+
+	START_CRIT_SECTION();
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+	return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 50455db9af..dff506d595 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1022,14 +1022,53 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	int			i;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1059,6 +1098,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1072,6 +1113,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..11e45c891d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1430,42 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..1d36035253 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -1410,6 +1413,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int			i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1460,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1495,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1543,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1551,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1593,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1622,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1636,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1650,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013caf..955a6285ef 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "miscadmin.h"
@@ -288,6 +289,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+											 BTPageState *state,
+											 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(!BTreeTupleIsPosting(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1050,8 +1060,36 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
-		/* _bt_sortaddtup() will perform full truncation later */
+
+		/*
+		 * Stashed copy must be a non-posting tuple, with truncated posting
+		 * list and correct t_tid since we're going to use it to build
+		 * downlink.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple	keytup;
+
+			/*
+			 * Form key tuple, that doesn't contain any ipd. NOTE: since we'll
+			 * need TID later, set t_tid to the first t_tid from posting list.
+			 */
+			keytupsz = BTreeTupleGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy(keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+			state->btps_minkey = CopyIndexTuple(keytup);
+			pfree(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);	/* _bt_sortaddtup() will
+														 * perform full
+														 * truncation later */
+
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1136,6 +1174,89 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
 /*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
@@ -1150,9 +1271,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns, system indexes
+	 * and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+						IndexRelationGetNumberOfAttributes(wstate->index))
+					   && (!IsSystemRelation(wstate->index))
+					   && (!wstate->index->rd_index->indisunique));
 
 	if (merge)
 	{
@@ -1266,19 +1399,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							add_item_to_posting(compressState, itup);
+						else
+							insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			insert_itupprev_to_page_buildadd(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab264ae..22ffcbc8be 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2238,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2249,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetMinTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2267,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2276,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2367,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2456,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2511,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2538,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2590,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2618,59 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3147ea4726..7daadc9cd5 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,35 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				int			i;
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6c61..85ee040428 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..0749e64b11 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * BTREE_VERSION 5 introduced new format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,148 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) BTreeTupleGetPosting(itup) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +483,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +493,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +501,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +512,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +524,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -567,6 +732,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +746,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +930,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -813,6 +982,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +996,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void add_item_to_posting(BTCompressState *compressState,
+								IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc86ea..6f60ca5f7b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
-- 
2.17.1

#46

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#45)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Sat, Jul 6, 2019 at 4:08 PM Peter Geoghegan <pg@bowt.ie> wrote:

I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.

I have some high level concerns about how the patch might increase
contention, which could make queries slower. Apparently that is a real
problem in other systems that use MVCC when their bitmap index feature
is used -- they are never really supposed to be used with OLTP apps.
This patch makes nbtree behave rather a lot like a bitmap index.
That's not exactly true, because you're not storing a bitmap or
compressing the TID lists, but they're definitely quite similar. It's
easy to imagine a hybrid approach, that starts with a B-Tree with
deduplication/TID lists, and eventually becomes a bitmap index as more
duplicates are added [1]http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159&rep=rep1&type=pdf.

It doesn't seem like it would be practical for these other MVCC
database systems to have standard B-Tree secondary indexes that
compress duplicates gracefully in the way that you propose to with
this patch, because lock contention would presumably be a big problem
for the same reason as it is with their bitmap indexes (whatever the
true reason actually is). Is it really possible to have something
that's adaptive, offering the best of both worlds?

Having dug into it some more, I think that the answer for us might
actually be "yes, we can have it both ways". Other database systems
that are also based on MVCC still probably use a limited form of index
locking, even in READ COMMITTED mode, though this isn't very widely
known. They need this for unique indexes, but they also need it for
transaction rollback, to remove old entries from the index when the
transaction must abort. The section "6.7 Standard Practice" from the
paper "Architecture of a Database System" [2]http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- Peter Geoghegan goes into this, saying:

"All production databases today support ACID transactions. As a rule,
they use write-ahead logging for durability, and two-phase locking for
concurrency control. An exception is PostgreSQL, which uses
multiversion concurrency control throughout."

I suggest reading "6.7 Standard Practice" in full.

Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!

Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159&rep=rep1&type=pdf
[2]: http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- Peter Geoghegan
--
Peter Geoghegan

#47

Bruce Momjian

bruce@momjian.us

over 6 years ago

In reply to: Peter Geoghegan (#46)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote:

Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!

Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.

Wow, I never thought of that. The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes. By definition, indexes with many
duplicates are not unique, so that doesn't apply.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#48

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#45)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Hi Peter,

Thank you very much for your attention to this patch. Let me comment
some points of your message.

On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.

I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.

Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.

Could you please elaborate more on preserving the logical contents? I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page. Is my understanding correct? But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does). Do you mean we need an API for accessing
logical B-tree tuples or something?

The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans. Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:

/*
* Lehman and Yao require that the downlink to the right page, which is to
* be inserted into the parent page in the second phase of a page split be
* a strict lower bound on items on the right page, and a non-strict upper
* bound for items on the left page. Assert that heap TIDs follow these
* invariants, since a heap TID value is apparently needed as a
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
BTreeTupleGetMinTID(firstright)) < 0);
...

This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)

Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs? If so, it's
definitely wrong.

* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.

I think unique indexes may benefit from deduplication not only because
of NULL values. Non-HOT updates produce duplicates of non-NULL values
in unique indexes. And those duplicates can take significant space.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#49

Alexander Korotkov

a.korotkov@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#46)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!

Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.

Good finding, thank you!

BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples. However, this
doesn't seem like something we should much care about. Providing an
option to turn deduplication off looks enough for me.

Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#50

Rafia Sabih

rafia.pghackers@gmail.com

over 6 years ago

In reply to: Anastasia Lubennikova (#1)

Fwd: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan <pg@bowt.ie> wrote:

* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with

I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?
I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.

Going through the patch, here are a few comments from me,

 /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?
Just because I didn't find any other such statements in this API and
there are many in this patch, so not sure how much are they needed.

/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Other than that, I am curious about the plans for its backward compatibility.

--
Regards,
Rafia Sabih

Import Notes

Reply to msg id not found: CA+FpmFfG6nsjE9BbPhU95SXhBon3mtdmMhwMeo3SEiAwjKuD3Q@mail.gmail.com

#51

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Bruce Momjian (#47)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian <bruce@momjian.us> wrote:

Wow, I never thought of that. The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes. By definition, indexes with many
duplicates are not unique, so that doesn't apply.

Right. Another advantage of their approach is that you can make
queries like this work:

UPDATE tab SET unique_col = unique_col + 1

This will not throw a unique violation error on most/all other DB
systems when the updated column (in this case "unique_col") has a
unique constraint/is the primary key. This behavior is actually
required by the SQL standard. An SQL statement is supposed to be
all-or-nothing, which Postgres doesn't quite manage here.

The section "6.6 Interdependencies of Transactional Storage" from the
paper "Architecture of a Database System" provides additional
background information (I should have suggested reading both 6.6 and
6.7 together).

--
Peter Geoghegan

#52

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Alexander Korotkov (#48)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Could you please elaborate more on preserving the logical contents? I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".

That's exactly what I mean.

So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page. Is my understanding correct?

Yes.

But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does). Do you mean we need an API for accessing
logical B-tree tuples or something?

Well, contrib/amcheck isn't really outside code. But amcheck's
"rootdescend" option will still need to be able to supply a heap TID
as just another column, and get back zero or one logical tuples from
the index. This is important because retail index tuple deletion needs
to be able to think about logical tuples in the same way. I also think
that it might be useful for the planner to expect to get back
duplicates in heap TID order in the future (or in reverse order in the
case of a backwards scan). Query execution and VACUUM code outside of
nbtree should be able to pretend that there is no such thing as a
posting list.

The main thing that the patch is missing that is needed to "preserve
logical contents" is the ability to update/expand an *existing*
posting list due to a retail insertion of a new duplicate that happens
to be within the range of that existing posting list. This will
usually be a non-HOT update that doesn't change the value for the row
in the index -- that must change the posting list, even when there is
available space on the page without recompressing. We must still
occasionally be eager, like GIN always is, though in practice we'll
almost always add to posting lists in a lazy fashion, when it looks
like we might have to split the page -- the lazy approach seems to
perform best.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans. Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

We can either prevent index-only scans in the case of affected
indexes, or prevent compression, or give the user a choice. I'm not
too worried about how that will work for users just yet.

Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs? If so, it's
definitely wrong.

Yes, it can, since the assertion fails. It looks like the assertion
itself was changed to match what I expect, so I assume that this bug
will be fixed in the next version of the patch. It fails with a fairly
big index on text for me.

* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.

I think unique indexes may benefit from deduplication not only because
of NULL values. Non-HOT updates produce duplicates of non-NULL values
in unique indexes. And those duplicates can take significant space.

I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.

--
Peter Geoghegan

#53

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Alexander Korotkov (#49)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples. However, this
doesn't seem like something we should much care about. Providing an
option to turn deduplication off looks enough for me.

There was an issue like this with my v12 work on nbtree, with the
TPC-C indexes. They were always ~40% smaller, but there was a
regression when TPC-C was used with a small number of warehouses, when
the data could easily fit in memory (which is not allowed by the TPC-C
spec, in effect). TPC-C is very write-heavy, which combined with
everything else causes this problem. I wasn't doing anything too fancy
there -- the regression seemed to happen simply because the index was
smaller, not because of the overhead of doing page splits differently
or anything like that (there were far fewer splits).

I expect there to be some regression for workloads like this. I am
willing to accept that provided it's not too noticeable, and doesn't
have an impact on other workloads. I am optimistic about it.

Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.

I thought that we might implement them in nbtree myself. But we don't
need to decide now.

--
Peter Geoghegan

#54

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Rafia Sabih (#50)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?

It'll only deduplicate full duplicates. It works with multi-column
indexes, provided the entire set of values in duplicated -- not just a
prefix. Prefix compression is possible, but it's more complicated. It
seems to generally require the DBA to specify a prefix length,
expressed as a number of prefix columns.

I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.

The v12 work on nbtree (making heap TID a tiebreaker column) seems to
have made the general approach a lot more effective. Compression is
performed lazily, not eagerly, which seems to work a lot better.

+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?

I hope so too.

/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

I think that this should be a constant or something.

/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.

Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Agreed. I also don't like anthropomorphizing code like this.

Other than that, I am curious about the plans for its backward compatibility.

Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.

--
Peter Geoghegan

#55

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#52)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan <pg@bowt.ie> wrote:

I think unique indexes may benefit from deduplication not only because
of NULL values. Non-HOT updates produce duplicates of non-NULL values
in unique indexes. And those duplicates can take significant space.

I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.

I thought about this some more. I believe that the LP_DEAD bit setting
within _bt_check_unique() is generally more important than the more
complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even
though the _bt_check_unique() thing can only be used with unique
indexes. Also, I have often thought that we don't do enough to take
advantage of the special characteristics of unique indexes -- they
really are quite different. I believe that other database systems do
this in various ways. Maybe we should too.

Unique indexes are special because there can only ever be zero or one
tuples of the same value that are visible to any possible MVCC
snapshot. Within the index AM, there is little difference between an
UPDATE by a transaction and a DELETE + INSERT of the same value by a
transaction. If there are 3 or 5 duplicates within a unique index,
then there is a strong chance that VACUUM could reclaim some of them,
given the chance. It is worth going to a little effort to find out.

In a traditional serial/bigserial primary key, the key space that is
typically "owned" by the left half of a rightmost page split describes
a range of about ~366 items, with few or no gaps for other values that
didn't exist at the time of the split (i.e. the two pivot tuples on
each side cover a range that is equal to the number of items itself).
If the page ever splits again, the chances of it being due to non-HOT
updates is perhaps 100%. Maybe VACUUM just didn't get around to the
index in time, or maybe there is a long running xact, or whatever. If
we can delay page splits in indexes like this, then we could easily
prevent them from *ever* happening.

Our first line of defense against page splits within unique indexes
will probably always be LP_DEAD bits set within _bt_check_unique(),
because it costs so little -- same as today. We could also add a
second line of defense: deduplication -- same as with non-unique
indexes with the patch. But we can even add a third line of defense on
top of those two: more aggressive reclaiming of posting list space, by
going to the heap to check the visibility status of earlier posting
list entries. We can do this optimistically when there is no LP_DEAD
bit set, based on heuristics.

The high level principle here is that we can justify going to a small
amount of extra effort for the chance to avoid a page split, and maybe
even more than a small amount. Our chances of reversing the split by
merging pages later on are almost zero. The two halves of the split
will probably each get dirtied again and again in the future if we
cannot avoid it, plus we have to dirty the parent page, and the old
sibling page (to update its left link). In general, a page split is
already really expensive. We could do something like amortize the cost
of accessing the heap a second time for tuples that we won't have
considered setting the LP_DEAD bit on within _bt_check_unique() by
trying the *same* heap page a *second* time where possible (distinct
values are likely to be nearby on the same page). I think that an
approach like this could work quite well for many workloads. You only
pay a cost (visiting the heap an extra time) when it looks like you'll
get a benefit (not splitting the page).

As you know, Andres already changed nbtree to get an XID for conflict
purposes on the primary by visiting the heap a second time (see commit
558a9165e08), when we need to actually reclaim LP_DEAD space. I
anticipated that we could extend this to do more clever/eager/lazy
cleanup of additional items before that went in, which is a closely
related idea. See:

/messages/by-id/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ@mail.gmail.com

I know that this is a bit hand-wavy; the details certainly need to be
worked out. However, it is not so different to the "ghost bit" design
that other systems use with their non-unique indexes (though this idea
applies specifically to unique indexes in our case). The main
difference is that we're going to the heap rather than to UNDO,
because that's where we store our visibility information. That doesn't
seem like such a big difference -- we are also reasonably confident
that we'll find that the TID is dead, even without LP_DEAD bits being
set, because we only do the extra stuff with unique indexes. And, we
do it lazily.

--
Peter Geoghegan

#56

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#54)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

11.07.2019 21:19, Peter Geoghegan wrote:

On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:

Hi,
Peter, Rafia, thanks for the review. New version is attached.

+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?

I hope so too.

Yes, these messages are only for debugging.
I haven't delete them since this is still work in progress
and it's handy to be able to print inner details.
Maybe I should also write a patch for pageinspect.

/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

I think that this should be a constant or something.

Fixed. Now this is a constant in nbtree.h. I'm not 100% sure about the
value.
When the code will stabilize we can benchmark it and find optimal value.

/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.

You are right, now it is impossible to meet dead items in this function.
Though it can change in the future if, for example, _bt_vacuum_one_page
will behave lazily.
So this is just a sanity check. Maybe it's worth to move it to Assert.

Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Agreed. I also don't like anthropomorphizing code like this.

Fixed.

Other than that, I am curious about the plans for its backward compatibility.

Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.

Current implementation is backward compatible.
To distinguish posting tuples, it only adds one new flag combination.
This combination was never possible before. Comment about version 5 is
deleted.

I also added a patch for amcheck.

There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this
invariant lazily.

Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to
implement
insertion into the middle of the posting list. I'll send a new version
this week.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-btree_compression_pg12_v2.patchtext/x-patch; name=0001-btree_compression_pg12_v2.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9126c18..2b05b1e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+				int i;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..26ddf32 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
 			}
 
 			/*
+			 * Before considering moving right, try to compress the page
+			 */
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+			/*
 			 * Nope, so check conditions (b) and (c) enumerated above
 			 *
 			 * The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+	{
+		elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+		/*
+		 * this may happen if tuple is bigger than freespace fallback to
+		 * uncompressed page case
+		 */
+		if (compressState->ntuples > 0)
+			pfree(to_insert);
+		return false;
+	}
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+	return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			n_posting_on_page = 0;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns, system indexes
+	 * and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+						IndexRelationGetNumberOfAttributes(rel))
+					   && (!IsSystemRelation(rel))
+					   && (!rel->rd_index->indisunique));
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Heuristic to avoid trying to compress page that has already contain
+	 * mostly compressed items
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (BTreeTupleIsPosting(item))
+			n_posting_on_page++;
+	}
+
+	/*
+	 * If we have only a few uncompressed items on the full page,
+	 * it isn't worth to compress them
+	 */
+	if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+		return;
+
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+		{
+			/*
+			 * Should never happen. Anyway, fallback gently to scenario of
+			 * incompressible page and just return from function.
+			 */
+			elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+			return;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		/*
+		 * We do not expect to meet any DEAD items, since this function is
+		 * called right after _bt_vacuum_one_page(). If for some reason we
+		 * found dead item, don't compress it, to allow upcoming microvacuum
+		 * or vacuum clean it up.
+		 */
+		if (ItemIdIsDead(itemId))
+			continue;
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					add_item_to_posting(compressState, itup);
+				}
+				else if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+			else
+			{
+				/*
+				 * Tuples are not equal. Insert itupprev into index. Save
+				 * current tuple for the next iteration.
+				 */
+				if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	if (!insert_itupprev_to_page(newpage, compressState))
+	{
+		elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+		return;
+	}
+
+	START_CRIT_SECTION();
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+	return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 50455db..dff506d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1022,14 +1022,53 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	int			i;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1059,6 +1098,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1072,6 +1113,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..11e45c8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1431,42 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..49a1aae 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -665,6 +668,9 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 * TODO when itup is a posting tuple, the check becomes more complex.
+	 * we have an option that key nor smaller, nor larger than the tuple,
+	 * but exactly in between of BTreeTupleGetMinTID to BTreeTupleGetMaxTID.
 	 */
 	heapTid = BTreeTupleGetHeapTID(itup);
 	if (key->scantid == NULL)
@@ -1410,6 +1416,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int			i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1463,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1498,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1546,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1554,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1596,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1625,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1639,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1653,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..955a628 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "miscadmin.h"
@@ -288,6 +289,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+											 BTPageState *state,
+											 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(!BTreeTupleIsPosting(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1050,8 +1060,36 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
-		/* _bt_sortaddtup() will perform full truncation later */
+
+		/*
+		 * Stashed copy must be a non-posting tuple, with truncated posting
+		 * list and correct t_tid since we're going to use it to build
+		 * downlink.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple	keytup;
+
+			/*
+			 * Form key tuple, that doesn't contain any ipd. NOTE: since we'll
+			 * need TID later, set t_tid to the first t_tid from posting list.
+			 */
+			keytupsz = BTreeTupleGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy(keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+			state->btps_minkey = CopyIndexTuple(keytup);
+			pfree(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);	/* _bt_sortaddtup() will
+														 * perform full
+														 * truncation later */
+
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1137,6 +1175,89 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1150,9 +1271,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns, system indexes
+	 * and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+						IndexRelationGetNumberOfAttributes(wstate->index))
+					   && (!IsSystemRelation(wstate->index))
+					   && (!wstate->index->rd_index->indisunique));
 
 	if (merge)
 	{
@@ -1266,19 +1399,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							add_item_to_posting(compressState, itup);
+						else
+							insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			insert_itupprev_to_page_buildadd(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..0da6fa8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2238,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2249,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetMinTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2267,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2276,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2367,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2456,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2511,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2538,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2590,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2618,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+		  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3147ea4..7daadc9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,35 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				int			i;
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f2..7d0d456 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
 
-/* Get/set downlink block number */
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) BTreeTupleGetPosting(itup) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -567,6 +741,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +755,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +939,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -813,6 +991,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +1006,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void add_item_to_posting(BTCompressState *compressState,
+								IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..6f60ca5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

#57

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Anastasia Lubennikova (#56)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

17.07.2019 19:36, Anastasia Lubennikova:

There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this
invariant lazily.

Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to
implement
insertion into the middle of the posting list. I'll send a new version
this week.

Patch 0002 (must be applied on top of 0001) implements preserving of
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:

set client_min_messages to debug4;
drop table tbl;
create table tbl (i1 int, i2 int);
insert into tbl select 1, i from generate_series(0,1000) as i;
insert into tbl select 1, i from generate_series(0,1000) as i;
create index idx on tbl (i1);
delete from tbl where i2 <500;
vacuum tbl ;
insert into tbl select 1, i from generate_series(1001, 1500) as i;

The last insert triggers several insertions that can be seen in debug
messages.
I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.

Please, in your review, pay particular attention to usage of
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want
BTreeTupleGetMaxTID instead.
Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.

Next week I'm going to check performance and try to find specific
scenarios where this
feature can lead to degradation and measure it, to understand if we need
to make this deduplication optional.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-btree_compression_pg12_v2.patchtext/x-patch; name=0001-btree_compression_pg12_v2.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9126c18..2b05b1e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+				int i;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..26ddf32 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
 			}
 
 			/*
+			 * Before considering moving right, try to compress the page
+			 */
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+			/*
 			 * Nope, so check conditions (b) and (c) enumerated above
 			 *
 			 * The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+	{
+		elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+		/*
+		 * this may happen if tuple is bigger than freespace fallback to
+		 * uncompressed page case
+		 */
+		if (compressState->ntuples > 0)
+			pfree(to_insert);
+		return false;
+	}
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+	return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			n_posting_on_page = 0;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns, system indexes
+	 * and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+						IndexRelationGetNumberOfAttributes(rel))
+					   && (!IsSystemRelation(rel))
+					   && (!rel->rd_index->indisunique));
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Heuristic to avoid trying to compress page that has already contain
+	 * mostly compressed items
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (BTreeTupleIsPosting(item))
+			n_posting_on_page++;
+	}
+
+	/*
+	 * If we have only a few uncompressed items on the full page,
+	 * it isn't worth to compress them
+	 */
+	if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+		return;
+
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+		{
+			/*
+			 * Should never happen. Anyway, fallback gently to scenario of
+			 * incompressible page and just return from function.
+			 */
+			elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+			return;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		/*
+		 * We do not expect to meet any DEAD items, since this function is
+		 * called right after _bt_vacuum_one_page(). If for some reason we
+		 * found dead item, don't compress it, to allow upcoming microvacuum
+		 * or vacuum clean it up.
+		 */
+		if (ItemIdIsDead(itemId))
+			continue;
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					add_item_to_posting(compressState, itup);
+				}
+				else if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+			else
+			{
+				/*
+				 * Tuples are not equal. Insert itupprev into index. Save
+				 * current tuple for the next iteration.
+				 */
+				if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	if (!insert_itupprev_to_page(newpage, compressState))
+	{
+		elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+		return;
+	}
+
+	START_CRIT_SECTION();
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+	return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 50455db..dff506d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1022,14 +1022,53 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	int			i;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1059,6 +1098,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1072,6 +1113,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..11e45c8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1431,42 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..49a1aae 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -665,6 +668,9 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 * TODO when itup is a posting tuple, the check becomes more complex.
+	 * we have an option that key nor smaller, nor larger than the tuple,
+	 * but exactly in between of BTreeTupleGetMinTID to BTreeTupleGetMaxTID.
 	 */
 	heapTid = BTreeTupleGetHeapTID(itup);
 	if (key->scantid == NULL)
@@ -1410,6 +1416,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int			i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1463,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1498,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1546,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1554,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1596,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BTreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1625,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1639,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1653,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..955a628 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "catalog/catalog.h"
 #include "catalog/index.h"
 #include "commands/progress.h"
 #include "miscadmin.h"
@@ -288,6 +289,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+											 BTPageState *state,
+											 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(!BTreeTupleIsPosting(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1050,8 +1060,36 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
-		/* _bt_sortaddtup() will perform full truncation later */
+
+		/*
+		 * Stashed copy must be a non-posting tuple, with truncated posting
+		 * list and correct t_tid since we're going to use it to build
+		 * downlink.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple	keytup;
+
+			/*
+			 * Form key tuple, that doesn't contain any ipd. NOTE: since we'll
+			 * need TID later, set t_tid to the first t_tid from posting list.
+			 */
+			keytupsz = BTreeTupleGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy(keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+			state->btps_minkey = CopyIndexTuple(keytup);
+			pfree(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);	/* _bt_sortaddtup() will
+														 * perform full
+														 * truncation later */
+
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1137,6 +1175,89 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1150,9 +1271,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns, system indexes
+	 * and unique indexes.
+	 */
+	use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+						IndexRelationGetNumberOfAttributes(wstate->index))
+					   && (!IsSystemRelation(wstate->index))
+					   && (!wstate->index->rd_index->indisunique));
 
 	if (merge)
 	{
@@ -1266,19 +1399,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							add_item_to_posting(compressState, itup);
+						else
+							insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						insert_itupprev_to_page_buildadd(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			insert_itupprev_to_page_buildadd(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..0da6fa8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2238,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2249,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetMinTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2267,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2276,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2367,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2456,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2511,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2538,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2590,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2618,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+		  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3147ea4..7daadc9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,35 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				int			i;
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f2..7d0d456 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
 
-/* Get/set downlink block number */
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) BTreeTupleGetPosting(itup) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -567,6 +741,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +755,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +939,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -813,6 +991,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +1006,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void add_item_to_posting(BTCompressState *compressState,
+								IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..6f60ca5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

0002-btree_compression_pg12_v2.patchtext/x-patch; name=0002-btree_compression_pg12_v2.patchDownload

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 26ddf32..c7bb25a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -42,6 +42,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  BTStack stack,
 									  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+					Buffer buf,
+					IndexTuple newitup,
+					OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+						   Buffer buf,
+						   Buffer cbuf,
+						   BTStack stack,
+						   IndexTuple itup,
+						   OffsetNumber newitemoff,
+						   bool split_only_page, int in_posting_offset);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -51,7 +62,7 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, int in_posting_offset);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -300,10 +311,17 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		if (insertstate.in_posting_offset)
+			_bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+									  InvalidBuffer, stack, itup, newitemoff,
+									  false, insertstate.in_posting_offset);
+		else
+			_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+						   stack, itup, newitemoff, false);
 	}
 	else
 	{
@@ -914,6 +932,162 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+					Buffer buf,
+					IndexTuple newitup,
+					OffsetNumber newitemoff)
+{
+	Page page = BufferGetPage(buf);
+	Size newitupsz = IndexTupleSize(newitup);
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, newitemoff);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			RelationGetRelationName(rel));
+
+	MarkBufferDirty(buf);
+
+	/* Xlog stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_insert xlrec;
+		XLogRecPtr	recptr;
+		BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		xlrec.offnum = newitemoff;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+		Assert(P_ISLEAF(pageop));
+
+		/*
+		 * Force pull page write to keep code simple
+		 * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+		 */
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ *		Insert a tuple on a particular page in the index
+ *		(compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+			   BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
+			   BTStack stack,
+			   IndexTuple itup,
+			   OffsetNumber newitemoff,
+			   bool split_only_page,
+			   int in_posting_offset)
+{
+	IndexTuple oldtup;
+	IndexTuple lefttup;
+	IndexTuple righttup;
+	ItemPointerData *ipd;
+	IndexTuple 		newitup;
+	Page			page;
+	int				nipd, nipd_right;
+
+	page = BufferGetPage(buf);
+	/* get old posting tuple */
+	oldtup = (IndexTuple) PageGetItem(page, PageGetItemId(page, newitemoff));
+	Assert(BTreeTupleIsPosting(oldtup));
+	nipd = BTreeTupleGetNPosting(oldtup);
+
+	/* At first, check if the new itempointer fits into the tuple's posting list.
+	*  Also check if new itempointer fits into the page.
+	 * If not, posting tuple's split is required in both cases.
+	 */
+	if ((BTMaxItemSize(page) < (IndexTupleSize(oldtup) + sizeof(ItemIdData))) ||
+		PageGetFreeSpace(page) < IndexTupleSize(oldtup) + sizeof(ItemPointerData))
+	{
+		/*
+		 * Split posting tuple into two halves.
+		 * Left tuple contains all item pointes less than the new one
+		 * and right tuple contains new item pointer and all to the right.
+		 * TODO Probably we can come up with more clever algorithm.
+		 */
+		lefttup = BTreeFormPostingTuple(oldtup, BTreeTupleGetPosting(oldtup), in_posting_offset);
+
+		nipd_right = nipd - in_posting_offset + 1;
+		ipd = palloc0(sizeof(ItemPointerData)*(nipd_right));
+		/* insert new item pointer */
+		memcpy(ipd, itup, sizeof(ItemPointerData));
+		/* copy item pointers from old tuple */
+		memcpy(ipd+1,
+			   BTreeTupleGetPostingN(oldtup, in_posting_offset),
+			   sizeof(ItemPointerData)*(nipd-in_posting_offset));
+
+		righttup = BTreeFormPostingTuple(oldtup, ipd, nipd_right);
+
+		/*
+		 * Replace old tuple with a left tuple on a page.
+		 * And insert righttuple using ordinary _bt_insertonpg() function
+		 * If split is required, _bt_insertonpg will handle it.
+		 */
+		_bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+						stack, righttup, newitemoff, false);
+
+		pfree(ipd);
+		pfree(lefttup);
+		pfree(righttup);
+	}
+	else
+	{
+		ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+		/* copy item pointers from old tuple into ipd */
+		memcpy(ipd, BTreeTupleGetPosting(oldtup), sizeof(ItemPointerData)*in_posting_offset);
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd+in_posting_offset, itup, sizeof(ItemPointerData));
+		/* copy item pointers from old tuple into ipd */
+		memcpy(ipd+in_posting_offset+1,
+			BTreeTupleGetPostingN(oldtup, in_posting_offset),
+			sizeof(ItemPointerData)*(nipd-in_posting_offset));
+
+		newitup = BTreeFormPostingTuple(itup, ipd, nipd+1);
+
+		_bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+		pfree(ipd);
+		pfree(newitup);
+		_bt_relbuf(rel, buf);
+	}
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -1010,7 +1184,7 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup, 0);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1228,7 +1402,8 @@ _bt_insertonpg(Relation rel,
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  int in_posting_offset)
 {
 	Buffer		rbuf;
 	Page		origpage;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 49a1aae..58a050f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -507,7 +507,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid, &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -536,6 +536,45 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact postition inside the posting list,
+ * using TID as extra attribut.
+ */
+int32
+_bt_compare_posting(Relation rel,
+			BTScanInsert key,
+			Page page,
+			OffsetNumber offnum,
+			int *in_posting_offset)
+{
+	IndexTuple itup = (IndexTuple) PageGetItem(page,
+											   PageGetItemId(page, offnum));
+	int result = _bt_compare(rel, key, page, offnum);
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int low, high, mid, res;
+
+		low = 0;
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid, BTreeTupleGetPostingN(itup, mid));
+
+			if (res == -1)
+				high = mid;
+			else
+				low = mid + 1;
+		}
+		*in_posting_offset = mid;
+	}
+		return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -668,64 +707,112 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
-	 * TODO when itup is a posting tuple, the check becomes more complex.
-	 * we have an option that key nor smaller, nor larger than the tuple,
-	 * but exactly in between of BTreeTupleGetMinTID to BTreeTupleGetMaxTID.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex.
+	 * It is possible that the scankey belongs to the tuple's posting list
+	 * TID range.
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that
+	 * key matches tuple at this offset.
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			* Most searches have a scankey that is considered greater than a
+			* truncated pivot tuple if and when the scankey has equal values for
+			* attributes up to and including the least significant untruncated
+			* attribute in tuple.
+			*
+			* For example, if an index has the minimum two attributes (single
+			* user key attribute, plus heap TID attribute), and a page's high key
+			* is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+			* will not descend to the page to the left.  The search will descend
+			* right instead.  The truncated attribute in pivot tuple means that
+			* all non-pivot tuples on the page to the left are strictly < 'foo',
+			* so it isn't necessary to descend left.  In other words, search
+			* doesn't have to descend left because it isn't interested in a match
+			* that has a heap TID value of -inf.
+			*
+			* However, some searches (pivotsearch searches) actually require that
+			* we descend left when this happens.  -inf is treated as a possible
+			* match for omitted scankey attribute(s).  This is needed by page
+			* deletion, which must re-find leaf pages that are targets for
+			* deletion using their high keys.
+			*
+			* Note: the heap TID part of the test ensures that scankey is being
+			* compared to a pivot tuple with one or more truncated key
+			* attributes.
+			*
+			* Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+			* left here, since they have no heap TID attribute (and cannot have
+			* any -inf key values in any case, since truncation can only remove
+			* non-key attributes).  !heapkeyspace searches must always be
+			* prepared to deal with matches on both sides of the pivot once the
+			* leaf level is reached.
+			*/
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
-		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		* Treat truncated heap TID as minus infinity, since scankey has a key
+		* attribute value (scantid) that would otherwise be compared directly
+		*/
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
 	}
+	else
+	{
+		heapTid = BTreeTupleGetMinTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than posting tuple (%u,%u)",
+							offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+							ItemPointerGetOffsetNumberNoCheck(key->scantid),
+							ItemPointerGetBlockNumberNoCheck(heapTid),
+							ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+							offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+							ItemPointerGetOffsetNumberNoCheck(key->scantid),
+							ItemPointerGetBlockNumberNoCheck(heapTid),
+							ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
 
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+			/* if we got here, scantid is inbetween of posting items of the tuple */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+							offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+							ItemPointerGetOffsetNumberNoCheck(key->scantid),
+							ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(itup)),
+							ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(itup)),
+							ItemPointerGetBlockNumberNoCheck(heapTid),
+							ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
+	}
 }
 
 /*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7d0d456..918043f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -675,6 +675,13 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location
+	 * inside existing posting list,
+	 * save the position inside the list.
+	 */
+	int	in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -953,6 +960,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+						 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,

#58

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#57)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Patch 0002 (must be applied on top of 0001) implements preserving of
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:

Nice!

I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.

I'm fine with leaving them in. I have sometimes distributed a separate
patch with debug messages, but now that I think about it, that
probably wasn't a good use of time.

You will probably want to remove at least some of the debug messages
during performance testing. I'm thinking of code that appears in very
tight inner loops, such as the _bt_compare() code.

Please, in your review, pay particular attention to usage of
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want
BTreeTupleGetMaxTID instead.
Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.

One testing strategy that I plan to use for the patch is to
deliberately corrupt a compressed index in a subtle way using
pg_hexedit, and then see if amcheck detects the problem. For example,
I may swap the order of two TIDs in the middle of a posting list,
which is something that is unlikely to produce wrong answers to
queries, and won't even be detected by the "heapallindexed" check, but
is still wrong. If we can detect very subtle, adversarial corruption
like this, then we can detect any real-world problem.

Once we have confidence in amcheck's ability to detect problems with
posting lists in general, we can use it in many different contexts
without much thought. For example, we'll probably need to do long
running benchmarks to validate the performance of the patch. It's easy
to add amcheck testing at the end of each run. Every benchmark is now
also a correctness/stress test, for free.

Next week I'm going to check performance and try to find specific
scenarios where this
feature can lead to degradation and measure it, to understand if we need
to make this deduplication optional.

Sounds good, though I think it might be a bit too early to decide
whether or not it needs to be enabled by default. For one thing, the
approach to WAL-logging within _bt_compress_one_page() is probably
fairly inefficient, which may be a problem for certain workloads. It's
okay to leave it that way for now, because it is not relevant to the
core design of the patch. I'm sure that _bt_compress_one_page() can be
carefully optimized when the time comes.

My current focus is not on the raw performance itself. For now, I am
focussed on making sure that the compression works well, and that the
resulting indexes "look nice" in general. FWIW, the first few versions
of my v12 work on nbtree didn't actually make *anything* go faster. It
took a couple of months to fix the more important regressions, and a
few more months to fix all of them. I think that the work on this
patch may develop in a similar way. I am willing to accept regressions
in the unoptimized code during development because it seems likely
that you have the right idea about the data structure itself, which is
the one thing that I *really* care about. Once you get that right, the
remaining problems are very likely to either be fixable with further
work on optimizing specific code, or a price that users will mostly be
happy to pay to get the benefits.

--
Peter Geoghegan

#59

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#58)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Patch 0002 (must be applied on top of 0001) implements preserving of
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:

Nice!

Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:

$ psql -f amcheck-compress-test.sql
DROP TABLE
CREATE TABLE
CREATE INDEX
CREATE EXTENSION
INSERT 0 2001
psql:amcheck-compress-test.sql:6: ERROR: down-link lower bound
invariant violated for index "idx_desc_nl"
DETAIL: Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438.

Note that this test only has an INSERT statement. You have to use
bt_index_parent_check() to see the problem -- bt_index_check() will
not detect the problem.

--
Peter Geoghegan

#60

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#59)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:

Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.

Other changes:

* We now support system catalog indexes. There is no reason not to support them.

* Removed unnecessary code from _bt_buildadd().

* Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which
I used to fix that bug I mentioned. I agree that we should keep the
DEBUG4 traces around until the overall design settles down. I found
the ones that you added helpful, too.

* Added quite a few new assertions. For example, we need to still
support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot
let them use compression -- new defensive assertions were added to
make this break loudly.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.

* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.

I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.

Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.

* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be
explained somewhere. Possibly in nbtree.h. This is tricky stuff.

* I agree with all of your existing TODO items -- most of them seem
very important to me.

* Do we really need to keep BTreeTupleGetHeapTID(), now that we have
BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
that callers don't need to think about the pivot vs posting list thing
themselves? See the new code added to _bt_mkscankey() by v3, for
example. It now handles both cases/macros at once, in order to keep
its amcheck caller happy. amcheck's verify_nbtree.c received similar
ugly code in v3.

* We should at least experiment with applying compression when
inserting into unique indexes. Like Alexander, I think that
compression in unique indexes might work well, given how they must
work in Postgres.

My next steps will be to study the design of the
_bt_insertonpg_in_posting() stuff some more. It seems like you already
have the right general idea there, but I would like to come up with a
way of making _bt_insertonpg_in_posting() understand how to work with
space on the page with total certainty, much like nbtsplitloc.c does
today. This should allow us to make WAL-logging more
precise/incremental.

--
Peter Geoghegan

Attachments:

v3-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v3-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload

From bfa3121169f98d9bc8b8cce71502b98814c90f1f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v3 2/2] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 65 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 +++++++
 3 files changed, 76 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..30e2865076 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +286,51 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (!rel || !_bt_heapkeyspace(rel))
+		htid = NULL;
+	else if (!BTreeTupleIsPivot(itup))
+		htid = BTreeTupleGetMinTID(itup);
+	else
+		htid = BTreeTupleGetHeapTID(itup);
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +404,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +435,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +521,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v3-0001-Compression-deduplication-in-nbtree.patchapplication/x-patch; name=v3-0001-Compression-deduplication-in-nbtree.patchDownload

From 1f5d732152bfbee6008249a9619d9e80f868e7f8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Jul 2019 18:57:31 -0700
Subject: [PATCH v3 1/2] Compression/deduplication in nbtree.

Version with some revisions by me.
---
 contrib/amcheck/verify_nbtree.c         | 140 ++++++--
 src/backend/access/nbtree/nbtinsert.c   | 455 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     |  53 +++
 src/backend/access/nbtree/nbtree.c      | 142 ++++++--
 src/backend/access/nbtree/nbtsearch.c   | 283 ++++++++++++---
 src/backend/access/nbtree/nbtsort.c     | 197 +++++++++-
 src/backend/access/nbtree/nbtsplitloc.c |   7 +
 src/backend/access/nbtree/nbtutils.c    | 173 ++++++++-
 src/backend/access/nbtree/nbtxlog.c     |  35 +-
 src/backend/access/rmgrdesc/nbtdesc.c   |   6 +-
 src/include/access/itup.h               |   5 +
 src/include/access/nbtree.h             | 215 ++++++++++-
 src/include/access/nbtxlog.h            |  13 +-
 13 files changed, 1571 insertions(+), 153 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 55a3a4bbe0..19239410ff 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -889,6 +889,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -959,29 +960,79 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetMinTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 *
+		 * FIXME:  The calls to BTreeGetNthTupleOfPosting() allocate memory,
+		 * and are probably relatively expensive.  We should at least try to
+		 * make this happen at the same point that optional heapallindexed
+		 * verification needs to loop through each posting list.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			IndexTuple	onetup;
+			ItemPointerData last;
+
+			ItemPointerCopy(BTreeTupleGetMinTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+				if (ItemPointerCompare(&onetup->t_tid, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(&(onetup->t_tid)),
+									ItemPointerGetOffsetNumberNoCheck(&(onetup->t_tid)));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(&onetup->t_tid, &last);
+				/* Be tidy */
+				pfree(onetup);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1039,12 +1090,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1052,7 +1124,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1092,6 +1165,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1115,6 +1191,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1129,11 +1206,16 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			if (!BTreeTupleIsPivot(itup))
+				tid = BTreeTupleGetMinTID(itup);
+			else
+				tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1142,9 +1224,14 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			if (!BTreeTupleIsPivot(itup))
+				tid = BTreeTupleGetMinTID(itup);
+			else
+				tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1154,10 +1241,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1918,10 +2005,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation.  Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2525,14 +2613,20 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	/* XXX: Again, I wonder if we need both of these macros... */
+	if (!BTreeTupleIsPivot(itup))
+		result = BTreeTupleGetMinTID(itup);
+	else
+		result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f8849d4..b6407b80b6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  BTStack stack,
 									  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+								  Buffer buf,
+								  IndexTuple newitup,
+								  OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+									  Buffer buf,
+									  Buffer cbuf,
+									  BTStack stack,
+									  IndexTuple itup,
+									  OffsetNumber newitemoff,
+									  bool split_only_page, int in_posting_offset);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		if (insertstate.in_posting_offset)
+			_bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+									  InvalidBuffer, stack, itup, newitemoff,
+									  false, insertstate.in_posting_offset);
+		else
+			_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+						   stack, itup, newitemoff, false);
 	}
 	else
 	{
@@ -759,6 +779,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -900,6 +926,191 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+					  Buffer buf,
+					  IndexTuple newitup,
+					  OffsetNumber newitemoff)
+{
+	Page		page = BufferGetPage(buf);
+	Size		newitupsz = IndexTupleSize(newitup);
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, newitemoff);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	MarkBufferDirty(buf);
+
+	/* Xlog stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = newitemoff;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+		Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+		/*
+		 * Force full page write to keep code simple
+		 *
+		 * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+		 */
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ *		Insert a tuple on a particular page in the index
+ *		(compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+						  BTScanInsert itup_key,
+						  Buffer buf,
+						  Buffer cbuf,
+						  BTStack stack,
+						  IndexTuple itup,
+						  OffsetNumber newitemoff,
+						  bool split_only_page,
+						  int in_posting_offset)
+{
+	IndexTuple	origtup;
+	IndexTuple	lefttup;
+	IndexTuple	righttup;
+	ItemPointerData *ipd;
+	IndexTuple	newitup;
+	Page		page;
+	int			nipd,
+				nipd_right;
+
+	page = BufferGetPage(buf);
+	/* get old posting tuple */
+	origtup = (IndexTuple) PageGetItem(page, PageGetItemId(page, newitemoff));
+	Assert(BTreeTupleIsPosting(origtup));
+	nipd = BTreeTupleGetNPosting(origtup);
+	Assert(in_posting_offset < nipd);
+	Assert(itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace);
+
+	elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+	/*
+	 * At first, check if the new itempointer fits into the tuple's posting
+	 * list.
+	 *
+	 * Also check if new itempointer fits into the page.
+	 *
+	 * If not, posting tuple's split is required in both cases.
+	 *
+	 * XXX: Think some more about alignment - pg
+	 */
+	if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+		PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+	{
+		/*
+		 * Split posting tuple into two halves.
+		 *
+		 * Left tuple contains all item pointes less than the new one and
+		 * right tuple contains new item pointer and all to the right.
+		 *
+		 * TODO Probably we can come up with more clever algorithm.
+		 */
+		lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+										in_posting_offset);
+
+		nipd_right = nipd - in_posting_offset + 1;
+		ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+		/* insert new item pointer */
+		memcpy(ipd, itup, sizeof(ItemPointerData));
+		/* copy item pointers from original tuple that belong on right */
+		memcpy(ipd + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+		elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+			 nipd, in_posting_offset);
+
+		Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+								  BTreeTupleGetMinTID(righttup)) < 0);
+
+		/*
+		 * Replace old tuple with a left tuple on a page.
+		 *
+		 * And insert righttuple using ordinary _bt_insertonpg() function If
+		 * split is required, _bt_insertonpg will handle it.
+		 */
+		_bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+					   stack, righttup, newitemoff + 1, false);
+
+		pfree(ipd);
+		pfree(lefttup);
+		pfree(righttup);
+	}
+	else
+	{
+		ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+		elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+		/* copy item pointers from original tuple into ipd */
+		memcpy(ipd, BTreeTupleGetPosting(origtup),
+			   sizeof(ItemPointerData) * in_posting_offset);
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+		/* copy item pointers from old tuple into ipd */
+		memcpy(ipd + in_posting_offset + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+		_bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+		pfree(ipd);
+		pfree(newitup);
+		_bt_relbuf(rel, buf);
+	}
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -2286,3 +2497,243 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+	{
+		elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+		/*
+		 * this may happen if tuple is bigger than freespace fallback to
+		 * uncompressed page case
+		 */
+		if (compressState->ntuples > 0)
+			pfree(to_insert);
+
+		return false;
+	}
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+
+	return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			n_posting_on_page = 0;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+					   IndexRelationGetNumberOfAttributes(rel) &&
+					   !rel->rd_index->indisunique);
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Heuristic to avoid trying to compress page that has already contain
+	 * mostly compressed items
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (BTreeTupleIsPosting(item))
+			n_posting_on_page++;
+	}
+
+	/*
+	 * If we have only a few uncompressed items on the full page, it isn't
+	 * worth to compress them
+	 */
+	if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+		return;
+
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+		{
+			/*
+			 * Should never happen. Anyway, fallback gently to scenario of
+			 * incompressible page and just return from function.
+			 */
+			elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+			return;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		/*
+		 * We do not expect to meet any DEAD items, since this function is
+		 * called right after _bt_vacuum_one_page(). If for some reason we
+		 * found dead item, don't compress it, to allow upcoming microvacuum
+		 * or vacuum clean it up.
+		 */
+		if (ItemIdIsDead(itemId))
+			continue;
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(compressState, itup);
+				}
+				else if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+			else
+			{
+				/*
+				 * Tuples are not equal. Insert itupprev into index. Save
+				 * current tuple for the next iteration.
+				 */
+				if (!insert_itupprev_to_page(newpage, compressState))
+				{
+					elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+					return;
+				}
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	if (!insert_itupprev_to_page(newpage, compressState))
+	{
+		elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5962126743..707a5d0fdb 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..22fb228b81 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1430,41 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..3e53675c82 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid,
+									 &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+					BTScanInsert key,
+					Page page,
+					OffsetNumber offnum,
+					int *in_posting_offset)
+{
+	IndexTuple	itup;
+	int			result;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	result = _bt_compare(rel, key, page, offnum);
+
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int			low,
+					high,
+					mid,
+					res;
+
+		low = 0;
+		/* "high" is past end of posting list for loop invariant */
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid,
+									 BTreeTupleGetPostingN(itup, mid));
+
+			if (res >= 1)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+
+		*in_posting_offset = high;
+	}
+
+	return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex. It is
+	 * possible that the scankey belongs to the tuple's posting list TID
+	 * range.
+	 *
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+	 * matches tuple at this offset.
+	 *
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			 * Most searches have a scankey that is considered greater than a
+			 * truncated pivot tuple if and when the scankey has equal values
+			 * for attributes up to and including the least significant
+			 * untruncated attribute in tuple.
+			 *
+			 * For example, if an index has the minimum two attributes (single
+			 * user key attribute, plus heap TID attribute), and a page's high
+			 * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+			 * search will not descend to the page to the left.  The search
+			 * will descend right instead.  The truncated attribute in pivot
+			 * tuple means that all non-pivot tuples on the page to the left
+			 * are strictly < 'foo', so it isn't necessary to descend left. In
+			 * other words, search doesn't have to descend left because it
+			 * isn't interested in a match that has a heap TID value of -inf.
+			 *
+			 * However, some searches (pivotsearch searches) actually require
+			 * that we descend left when this happens.  -inf is treated as a
+			 * possible match for omitted scankey attribute(s).  This is
+			 * needed by page deletion, which must re-find leaf pages that are
+			 * targets for deletion using their high keys.
+			 *
+			 * Note: the heap TID part of the test ensures that scankey is
+			 * being compared to a pivot tuple with one or more truncated key
+			 * attributes.
+			 *
+			 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+			 * the left here, since they have no heap TID attribute (and
+			 * cannot have any -inf key values in any case, since truncation
+			 * can only remove non-key attributes).  !heapkeyspace searches
+			 * must always be prepared to deal with matches on both sides of
+			 * the pivot once the leaf level is reached.
+			 */
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * Treat truncated heap TID as minus infinity, since scankey has a key
+		 * attribute value (scantid) that would otherwise be compared directly
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
+	}
+	else
+	{
+		heapTid = BTreeTupleGetMinTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int			cmp = ItemPointerCompare(key->scantid, heapTid);
+
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			/*
+			 * if we got here, scantid is inbetween of posting items of the
+			 * tuple
+			 */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+				 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+				 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+				 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(itup)),
+				 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(itup)),
+				 ItemPointerGetBlockNumberNoCheck(heapTid),
+				 ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
 	}
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
-
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	return 0;
 }
 
 /*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					/* XXX: Maybe this loop should be backwards? */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013caf..5545465f92 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1136,6 +1145,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd,
+				   BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
 /*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+					   IndexRelationGetNumberOfAttributes(wstate->index) &&
+					   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							_bt_add_posting_item(compressState, itup);
+						else
+							_bt_buildadd_posting(wstate, state,
+												 compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd874..fbb12dbff1 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -492,6 +492,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 *
+	 * FIXME: We can make better choices about split points by being clever
+	 * about the BTreeTupleIsPosting() case here.  All we need to do is
+	 * subtract the whole size of the posting list, then add
+	 * MAXALIGN(sizeof(ItemPointerData)), since we know for sure that
+	 * _bt_truncate() won't make a final high key that is larger even in the
+	 * worst case.
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab264ae..a6eee1bcd4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->nextkey = false;
 	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	/*
+	 * XXX: Do we need to have both BTreeTupleGetHeapTID() and
+	 * BTreeTupleGetMinTID()?
+	 */
+	if (itup && key->heapkeyspace)
+	{
+		if (!BTreeTupleIsPivot(itup))
+			key->scantid = BTreeTupleGetMinTID(itup);
+		else
+			key->scantid = BTreeTupleGetHeapTID(itup);
+	}
+	else
+		key->scantid = NULL;
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1800,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2160,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2186,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2195,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2253,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2264,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetMinTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2282,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2291,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2382,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff.  For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case.  Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8.  As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that).  In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts().  This needs to be nailed down
+ * in some way.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2486,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2541,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2568,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2620,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2648,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..5b30e36d27 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,35 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				int			i;
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..e4fa99ad27 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6c61..85ee040428 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c28e..d3e3cea60a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) BTreeTupleGetPosting(itup) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +674,12 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -567,6 +747,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +761,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +945,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -775,6 +959,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+								 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +999,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +1014,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+								 IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc86ea..6f60ca5f7b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
-- 
2.17.1

#61

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#60)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3.

Remember that index that I said was 5.5x smaller with the patch
applied, following retail insertions (a single big INSERT ... SELECT
...)? Well, it's 6.5x faster with this small additional patch applied
on top of the v3 I posted yesterday. Many of the indexes in my test
suite are about ~20% smaller __in addition to__ very big size
reductions. Some are even ~30% smaller than they were with v3 of the
patch. For example, the fair use implementation of TPC-H that my test
data comes from has an index on the "orders" o_orderdate column, named
idx_orders_orderdate, which is made ~30% smaller by the addition of
this simple patch (once again, this is following a single big INSERT
... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller
than it is with master/Postgres 12, in case you were wondering.

This new patch teaches nbtsplitloc.c to subtract posting list overhead
when sizing the new high key for the left half of a candidate split
point, since we know for sure that _bt_truncate() will at least manage
to truncate away that much from the new high key, even in the worst
case. Since posting lists are often very large, this can make a big
difference. This is actually just a bugfix, not a new idea -- I merely
made nbtsplitloc.c understand how truncation works with posting lists.

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

--
Peter Geoghegan

Attachments:

0003-Account-for-posting-list-overhead-during-splits.patchapplication/octet-stream; name=0003-Account-for-posting-list-overhead-during-splits.patchDownload

From 36147525a12101d8bde6c00a238759cd371eefcc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 24 Jul 2019 14:35:13 -0700
Subject: [PATCH 3/3] Account for posting list overhead during splits.

---
 src/backend/access/nbtree/nbtsplitloc.c | 37 +++++++++++++++++++------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index fbb12dbff1..77e1d46672 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
 							 && !newitemonleft);
 
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+							  BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId	 itemid;
+			IndexTuple newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+								  BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,16 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
-	 *
-	 * FIXME: We can make better choices about split points by being clever
-	 * about the BTreeTupleIsPosting() case here.  All we need to do is
-	 * subtract the whole size of the posting list, then add
-	 * MAXALIGN(sizeof(ItemPointerData)), since we know for sure that
-	 * _bt_truncate() won't make a final high key that is larger even in the
-	 * worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
-- 
2.17.1

#62

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#61)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

--
Peter Geoghegan

#63

Rafia Sabih

rafia.pghackers@gmail.com

over 6 years ago

In reply to: Peter Geoghegan (#62)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

Minor comment fix, pointes-->pointer, plus, are we really doing the
half, or is it just splitting into two.
/*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */

Some remains of 'he'.
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */

Everything reads just fine without 'us'.
/*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
-- 
Regards,
Rafia Sabih

#64

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#60)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

24.07.2019 4:22, Peter Geoghegan wrote:

Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.

Thank you for this review and fixes.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

It's ok to update it. There was no particular reason, just my habit.

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.

+ * FIXME: The calls to BTreeGetNthTupleOfPosting() allocate
memory,

If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not
tuples.

Fixed in version 4.

* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.

Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.

I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.

The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out
of space
on the page.

First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.

Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key

and compressed result is:
t_tid | t_info | key | t_tid | t_tid

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special
tests.

Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.

* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be
explained somewhere. Possibly in nbtree.h. This is tricky stuff.

Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.

* I agree with all of your existing TODO items -- most of them seem
very important to me.

* Do we really need to keep BTreeTupleGetHeapTID(), now that we have
BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
that callers don't need to think about the pivot vs posting list thing
themselves? See the new code added to _bt_mkscankey() by v3, for
example. It now handles both cases/macros at once, in order to keep
its amcheck caller happy. amcheck's verify_nbtree.c received similar
ugly code in v3.

No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.
On the other hand, in some cases BTreeTupleGetMinTID() looks more readable.
For example here:

Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),

BTreeTupleGetMinTID(righttup)) < 0);

* We should at least experiment with applying compression when
inserting into unique indexes. Like Alexander, I think that
compression in unique indexes might work well, given how they must
work in Postgres.

The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

One more thing I want to discuss:
/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
The explanation I see is that page had DEAD items, but for some reason
BTP_HAS_GARBAGE was not set so _bt_vacuum_one_page() was not called.
I find it difficult to understand what could lead to this situation,
so probably we need to inspect it closer to exclude the possibility of a
bug.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v4-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v4-0001-Compression-deduplication-in-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 55a3a4b..b8c1d03 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -889,6 +889,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -959,29 +960,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetMinTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetMinTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1039,12 +1084,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1052,7 +1118,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1092,6 +1159,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1115,6 +1185,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1129,11 +1200,16 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			if (!BTreeTupleIsPivot(itup))
+				tid = BTreeTupleGetMinTID(itup);
+			else
+				tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1142,9 +1218,14 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			if (!BTreeTupleIsPivot(itup))
+				tid = BTreeTupleGetMinTID(itup);
+			else
+				tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1154,10 +1235,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1918,10 +1999,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation.  Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2525,14 +2607,20 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	/* XXX: Again, I wonder if we need both of these macros... */
+	if (!BTreeTupleIsPivot(itup))
+		result = BTreeTupleGetMinTID(itup);
+	else
+		result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..57b6bb5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  BTStack stack,
 									  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+								  Buffer buf,
+								  IndexTuple newitup,
+								  OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+									  Buffer buf,
+									  Buffer cbuf,
+									  BTStack stack,
+									  IndexTuple itup,
+									  OffsetNumber newitemoff,
+									  bool split_only_page, int in_posting_offset);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		if (insertstate.in_posting_offset)
+			_bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+									  InvalidBuffer, stack, itup, newitemoff,
+									  false, insertstate.in_posting_offset);
+		else
+			_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+						   stack, itup, newitemoff, false);
 	}
 	else
 	{
@@ -759,6 +779,12 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 	}
 	else
 	{
@@ -900,6 +926,191 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+					  Buffer buf,
+					  IndexTuple newitup,
+					  OffsetNumber newitemoff)
+{
+	Page		page = BufferGetPage(buf);
+	Size		newitupsz = IndexTupleSize(newitup);
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, newitemoff);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	MarkBufferDirty(buf);
+
+	/* Xlog stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = newitemoff;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+		Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+		/*
+		 * Force full page write to keep code simple
+		 *
+		 * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+		 */
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ *		Insert a tuple on a particular page in the index
+ *		(compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+						  BTScanInsert itup_key,
+						  Buffer buf,
+						  Buffer cbuf,
+						  BTStack stack,
+						  IndexTuple itup,
+						  OffsetNumber newitemoff,
+						  bool split_only_page,
+						  int in_posting_offset)
+{
+	IndexTuple	origtup;
+	IndexTuple	lefttup;
+	IndexTuple	righttup;
+	ItemPointerData *ipd;
+	IndexTuple	newitup;
+	Page		page;
+	int			nipd,
+				nipd_right;
+
+	page = BufferGetPage(buf);
+	/* get old posting tuple */
+	origtup = (IndexTuple) PageGetItem(page, PageGetItemId(page, newitemoff));
+	Assert(BTreeTupleIsPosting(origtup));
+	nipd = BTreeTupleGetNPosting(origtup);
+	Assert(in_posting_offset < nipd);
+	Assert(itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace);
+
+	elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+	/*
+	 * At first, check if the new itempointer fits into the tuple's posting
+	 * list.
+	 *
+	 * Also check if new itempointer fits into the page.
+	 *
+	 * If not, posting tuple's split is required in both cases.
+	 *
+	 * XXX: Think some more about alignment - pg
+	 */
+	if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+		PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+	{
+		/*
+		 * Split posting tuple into two halves.
+		 *
+		 * Left tuple contains all item pointes less than the new one and
+		 * right tuple contains new item pointer and all to the right.
+		 *
+		 * TODO Probably we can come up with more clever algorithm.
+		 */
+		lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+										in_posting_offset);
+
+		nipd_right = nipd - in_posting_offset + 1;
+		ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+		/* insert new item pointer */
+		memcpy(ipd, itup, sizeof(ItemPointerData));
+		/* copy item pointers from original tuple that belong on right */
+		memcpy(ipd + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+		elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+			 nipd, in_posting_offset);
+
+		Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+								  BTreeTupleGetMinTID(righttup)) < 0);
+
+		/*
+		 * Replace old tuple with a left tuple on a page.
+		 *
+		 * And insert righttuple using ordinary _bt_insertonpg() function If
+		 * split is required, _bt_insertonpg will handle it.
+		 */
+		_bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+					   stack, righttup, newitemoff + 1, false);
+
+		pfree(ipd);
+		pfree(lefttup);
+		pfree(righttup);
+	}
+	else
+	{
+		ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+		elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+		/* copy item pointers from original tuple into ipd */
+		memcpy(ipd, BTreeTupleGetPosting(origtup),
+			   sizeof(ItemPointerData) * in_posting_offset);
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+		/* copy item pointers from old tuple into ipd */
+		memcpy(ipd + in_posting_offset + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+		_bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+		pfree(ipd);
+		pfree(newitup);
+		_bt_relbuf(rel, buf);
+	}
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -2286,3 +2497,221 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+	{
+		if (compressState->ntuples > 0)
+			pfree(to_insert);
+		elog(ERROR, "failed to add tuple to page while compresing it");
+	}
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			n_posting_on_page = 0;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+					   IndexRelationGetNumberOfAttributes(rel) &&
+					   !rel->rd_index->indisunique);
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Heuristic to avoid trying to compress page that has already contain
+	 * mostly compressed items
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (BTreeTupleIsPosting(item))
+			n_posting_on_page++;
+	}
+
+	/*
+	 * If we have only a few uncompressed items on the full page, it isn't
+	 * worth to compress them
+	 */
+	if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+		return;
+
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+		{
+			/*
+			 * Should never happen. Anyway, fallback gently to scenario of
+			 * incompressible page and just return from function.
+			 */
+			elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+			return;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		/*
+		 * We do not expect to meet any DEAD items, since this function is
+		 * called right after _bt_vacuum_one_page(). If for some reason we
+		 * found dead item, don't compress it, to allow upcoming microvacuum
+		 * or vacuum clean it up.
+		 */
+		if (ItemIdIsDead(itemId))
+			continue;
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(compressState, itup);
+				}
+				else
+				{
+					insert_itupprev_to_page(newpage, compressState);
+				}
+			}
+			else
+			{
+				insert_itupprev_to_page(newpage, compressState);
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	insert_itupprev_to_page(newpage, compressState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5962126..707a5d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1431,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..3e53675 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid,
+									 &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+					BTScanInsert key,
+					Page page,
+					OffsetNumber offnum,
+					int *in_posting_offset)
+{
+	IndexTuple	itup;
+	int			result;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	result = _bt_compare(rel, key, page, offnum);
+
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int			low,
+					high,
+					mid,
+					res;
+
+		low = 0;
+		/* "high" is past end of posting list for loop invariant */
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid,
+									 BTreeTupleGetPostingN(itup, mid));
+
+			if (res >= 1)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+
+		*in_posting_offset = high;
+	}
+
+	return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex. It is
+	 * possible that the scankey belongs to the tuple's posting list TID
+	 * range.
+	 *
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+	 * matches tuple at this offset.
+	 *
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			 * Most searches have a scankey that is considered greater than a
+			 * truncated pivot tuple if and when the scankey has equal values
+			 * for attributes up to and including the least significant
+			 * untruncated attribute in tuple.
+			 *
+			 * For example, if an index has the minimum two attributes (single
+			 * user key attribute, plus heap TID attribute), and a page's high
+			 * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+			 * search will not descend to the page to the left.  The search
+			 * will descend right instead.  The truncated attribute in pivot
+			 * tuple means that all non-pivot tuples on the page to the left
+			 * are strictly < 'foo', so it isn't necessary to descend left. In
+			 * other words, search doesn't have to descend left because it
+			 * isn't interested in a match that has a heap TID value of -inf.
+			 *
+			 * However, some searches (pivotsearch searches) actually require
+			 * that we descend left when this happens.  -inf is treated as a
+			 * possible match for omitted scankey attribute(s).  This is
+			 * needed by page deletion, which must re-find leaf pages that are
+			 * targets for deletion using their high keys.
+			 *
+			 * Note: the heap TID part of the test ensures that scankey is
+			 * being compared to a pivot tuple with one or more truncated key
+			 * attributes.
+			 *
+			 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+			 * the left here, since they have no heap TID attribute (and
+			 * cannot have any -inf key values in any case, since truncation
+			 * can only remove non-key attributes).  !heapkeyspace searches
+			 * must always be prepared to deal with matches on both sides of
+			 * the pivot once the leaf level is reached.
+			 */
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * Treat truncated heap TID as minus infinity, since scankey has a key
+		 * attribute value (scantid) that would otherwise be compared directly
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
 	}
+	else
+	{
+		heapTid = BTreeTupleGetMinTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int			cmp = ItemPointerCompare(key->scantid, heapTid);
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
 
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			/*
+			 * if we got here, scantid is inbetween of posting items of the
+			 * tuple
+			 */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+				 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+				 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+				 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(itup)),
+				 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(itup)),
+				 ItemPointerGetBlockNumberNoCheck(heapTid),
+				 ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
+	}
+
+	return 0;
 }
 
 /*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					/* XXX: Maybe this loop should be backwards? */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..5545465f9 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1137,6 +1146,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd,
+				   BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+					   IndexRelationGetNumberOfAttributes(wstate->index) &&
+					   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							_bt_add_posting_item(compressState, itup);
+						else
+							_bt_buildadd_posting(wstate, state,
+												 compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..fbb12db 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -492,6 +492,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 *
+	 * FIXME: We can make better choices about split points by being clever
+	 * about the BTreeTupleIsPosting() case here.  All we need to do is
+	 * subtract the whole size of the posting list, then add
+	 * MAXALIGN(sizeof(ItemPointerData)), since we know for sure that
+	 * _bt_truncate() won't make a final high key that is larger even in the
+	 * worst case.
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..a6eee1b 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->nextkey = false;
 	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	/*
+	 * XXX: Do we need to have both BTreeTupleGetHeapTID() and
+	 * BTreeTupleGetMinTID()?
+	 */
+	if (itup && key->heapkeyspace)
+	{
+		if (!BTreeTupleIsPivot(itup))
+			key->scantid = BTreeTupleGetMinTID(itup);
+		else
+			key->scantid = BTreeTupleGetHeapTID(itup);
+	}
+	else
+		key->scantid = NULL;
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1800,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2160,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2186,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2195,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2253,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2264,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetMinTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2282,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2291,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetMinTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2382,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff.  For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case.  Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8.  As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that).  In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts().  This needs to be nailed down
+ * in some way.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2486,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2541,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2568,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2620,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2648,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..5b30e36 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,35 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				int			i;
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..3127c41 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
 
-/* Get/set downlink block number */
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) BTreeTupleGetPosting(itup) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -501,6 +675,12 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -567,6 +747,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +761,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +945,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -775,6 +959,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+								 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +999,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +1014,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+								 IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..6f60ca5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

#65

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#64)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.

Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.

Good idea.

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special
tests.

It also passes my tests, FWIW.

Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.

That sounds correct, then.

No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.

I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just
have a comment next to BTreeTupleGetHeapTID(), and another comment
next to BTreeTupleGetMaxTID().

The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.

v5 makes _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken). I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
--
Peter Geoghegan

Attachments:

v5-0001-Compression-deduplication-in-nbtree.patchapplication/octet-stream; name=v5-0001-Compression-deduplication-in-nbtree.patchDownload

From 1df33bd12aaf21179da6d3aedaa7a2084e577d25 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Jul 2019 18:57:31 -0700
Subject: [PATCH v5 1/3] Compression/deduplication in nbtree.

Version with some revisions by me.
---
 contrib/amcheck/verify_nbtree.c         | 124 +++++--
 src/backend/access/nbtree/nbtinsert.c   | 430 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     |  53 +++
 src/backend/access/nbtree/nbtree.c      | 142 ++++++--
 src/backend/access/nbtree/nbtsearch.c   | 283 +++++++++++++---
 src/backend/access/nbtree/nbtsort.c     | 197 ++++++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  30 +-
 src/backend/access/nbtree/nbtutils.c    | 164 ++++++++-
 src/backend/access/nbtree/nbtxlog.c     |  34 +-
 src/backend/access/rmgrdesc/nbtdesc.c   |   6 +-
 src/include/access/itup.h               |   4 +
 src/include/access/nbtree.h             | 202 ++++++++++-
 src/include/access/nbtxlog.h            |  13 +-
 13 files changed, 1528 insertions(+), 154 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 55a3a4bbe0..da79dd1b62 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -889,6 +889,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -959,29 +960,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1039,12 +1084,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1052,7 +1118,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1092,6 +1159,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1115,6 +1185,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1129,11 +1200,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1142,9 +1215,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1154,10 +1229,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1918,10 +1993,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation.  Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2525,14 +2601,16 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5890f393f6..f0c1174e2a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  BTStack stack,
 									  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+								  Buffer buf,
+								  IndexTuple newitup,
+								  OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+									  Buffer buf,
+									  Buffer cbuf,
+									  BTStack stack,
+									  IndexTuple itup,
+									  OffsetNumber newitemoff,
+									  bool split_only_page, int in_posting_offset);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		if (insertstate.in_posting_offset)
+			_bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+									  InvalidBuffer, stack, itup, newitemoff,
+									  false, insertstate.in_posting_offset);
+		else
+			_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+						   stack, itup, newitemoff, false);
 	}
 	else
 	{
@@ -412,6 +432,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			}
 
 			curitemid = PageGetItemId(page, offset);
+			Assert(!BTreeTupleIsPosting(curitup));
 
 			/*
 			 * We can skip items that are marked killed.
@@ -759,6 +780,26 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+		{
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;		/* paranoia */
+
+			/*
+			 * FIXME: _bt_vacuum_one_page() won't have cleared the
+			 * BTP_HAS_GARBAGE flag when it didn't kill items.  Maybe we
+			 * should clear the BTP_HAS_GARBAGE flag bit from the page when
+			 * compression avoids a page split -- _bt_vacuum_one_page() is
+			 * expecting a page split that takes care of it.
+			 *
+			 * (On the other hand, maybe it doesn't matter very much.  A
+			 * comment update seems like the bare minimum we should do.)
+			 */
+		}
 	}
 	else
 	{
@@ -900,6 +941,208 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+					  Buffer buf,
+					  IndexTuple newitup,
+					  OffsetNumber newitemoff)
+{
+	Page		page = BufferGetPage(buf);
+	Size		newitupsz = IndexTupleSize(newitup);
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, newitemoff);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	MarkBufferDirty(buf);
+
+	/* Xlog stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = newitemoff;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+		Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+		/*
+		 * Force full page write to keep code simple
+		 *
+		 * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+		 */
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ *		Insert a tuple on a particular page in the index
+ *		(compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+						  BTScanInsert itup_key,
+						  Buffer buf,
+						  Buffer cbuf,
+						  BTStack stack,
+						  IndexTuple itup,
+						  OffsetNumber newitemoff,
+						  bool split_only_page,
+						  int in_posting_offset)
+{
+	IndexTuple	origtup;
+	IndexTuple	lefttup;
+	IndexTuple	righttup;
+	ItemPointerData *ipd;
+	IndexTuple	newitup;
+	ItemId		itemid;
+	Page		page;
+	int			nipd,
+				nipd_right;
+
+	page = BufferGetPage(buf);
+	/* get old posting tuple */
+	itemid = PageGetItemId(page, newitemoff);
+	origtup = (IndexTuple) PageGetItem(page, itemid);
+	Assert(BTreeTupleIsPosting(origtup));
+	nipd = BTreeTupleGetNPosting(origtup);
+	Assert(in_posting_offset < nipd);
+	Assert(itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace);
+
+	elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+	/*
+	 * Fist check if existing item is dead.
+	 *
+	 * Then check if the new itempointer fits into the tuple's posting list.
+	 *
+	 * Also check if new itempointer fits into the page.
+	 *
+	 * If not, posting tuple's split is required in both cases.
+	 *
+	 * XXX: Think some more about alignment - pg
+	 */
+	if (ItemIdIsDead(itemid))
+	{
+		/* FIXME: We need to call index_compute_xid_horizon_for_tuples() */
+		elog(DEBUG4, "replacing LP_DEAD posting list item, new off %d",
+			 newitemoff);
+		_bt_delete_and_insert(rel, buf, itup, newitemoff);
+		_bt_relbuf(rel, buf);
+	}
+	else if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+			 PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+	{
+		/*
+		 * Split posting tuple into two halves.
+		 *
+		 * Left tuple contains all item pointes less than the new one and
+		 * right tuple contains new item pointer and all to the right.
+		 *
+		 * TODO Probably we can come up with more clever algorithm.
+		 */
+		lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+										in_posting_offset);
+
+		nipd_right = nipd - in_posting_offset + 1;
+		ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+		/* insert new item pointer */
+		memcpy(ipd, itup, sizeof(ItemPointerData));
+		/* copy item pointers from original tuple that belong on right */
+		memcpy(ipd + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+		elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+			 nipd, in_posting_offset);
+
+		Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+								  BTreeTupleGetHeapTID(righttup)) < 0);
+
+		/*
+		 * Replace old tuple with a left tuple on a page.
+		 *
+		 * And insert righttuple using ordinary _bt_insertonpg() function If
+		 * split is required, _bt_insertonpg will handle it.
+		 *
+		 * FIXME: This doesn't seem very crash safe -- what if we fail after
+		 * _bt_delete_and_insert() but before _bt_insertonpg()?  We could
+		 * crash and then lose some of the logical tuples that used to be
+		 * contained within original posting list, but will now go into new
+		 * righttup posting list.
+		 */
+		_bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+					   stack, righttup, newitemoff + 1, false);
+
+		pfree(ipd);
+		pfree(lefttup);
+		pfree(righttup);
+	}
+	else
+	{
+		ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+		elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+		/* copy item pointers from original tuple into ipd */
+		memcpy(ipd, BTreeTupleGetPosting(origtup),
+			   sizeof(ItemPointerData) * in_posting_offset);
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+		/* copy item pointers from old tuple into ipd */
+		memcpy(ipd + in_posting_offset + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+		_bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+		pfree(ipd);
+		pfree(newitup);
+		_bt_relbuf(rel, buf);
+	}
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -2290,3 +2533,186 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "failed to add tuple to page while compresing it");
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+					   IndexRelationGetNumberOfAttributes(rel) &&
+					   !rel->rd_index->indisunique);
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during compression");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		/*
+		 * We do not expect to meet any DEAD items, since this function is
+		 * called right after _bt_vacuum_one_page(). If for some reason we
+		 * found dead item, don't compress it, to allow upcoming microvacuum
+		 * or vacuum clean it up.
+		 */
+		if (ItemIdIsDead(itemId))
+			continue;
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(compressState, itup);
+				}
+				else
+				{
+					insert_itupprev_to_page(newpage, compressState);
+				}
+			}
+			else
+			{
+				insert_itupprev_to_page(newpage, compressState);
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	insert_itupprev_to_page(newpage, compressState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de60f..86c662d4e6 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..22fb228b81 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1430,41 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf733..20975970d6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid,
+									 &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+					BTScanInsert key,
+					Page page,
+					OffsetNumber offnum,
+					int *in_posting_offset)
+{
+	IndexTuple	itup;
+	int			result;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	result = _bt_compare(rel, key, page, offnum);
+
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int			low,
+					high,
+					mid,
+					res;
+
+		low = 0;
+		/* "high" is past end of posting list for loop invariant */
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid,
+									 BTreeTupleGetPostingN(itup, mid));
+
+			if (res >= 1)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+
+		*in_posting_offset = high;
+	}
+
+	return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex. It is
+	 * possible that the scankey belongs to the tuple's posting list TID
+	 * range.
+	 *
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+	 * matches tuple at this offset.
+	 *
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			 * Most searches have a scankey that is considered greater than a
+			 * truncated pivot tuple if and when the scankey has equal values
+			 * for attributes up to and including the least significant
+			 * untruncated attribute in tuple.
+			 *
+			 * For example, if an index has the minimum two attributes (single
+			 * user key attribute, plus heap TID attribute), and a page's high
+			 * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+			 * search will not descend to the page to the left.  The search
+			 * will descend right instead.  The truncated attribute in pivot
+			 * tuple means that all non-pivot tuples on the page to the left
+			 * are strictly < 'foo', so it isn't necessary to descend left. In
+			 * other words, search doesn't have to descend left because it
+			 * isn't interested in a match that has a heap TID value of -inf.
+			 *
+			 * However, some searches (pivotsearch searches) actually require
+			 * that we descend left when this happens.  -inf is treated as a
+			 * possible match for omitted scankey attribute(s).  This is
+			 * needed by page deletion, which must re-find leaf pages that are
+			 * targets for deletion using their high keys.
+			 *
+			 * Note: the heap TID part of the test ensures that scankey is
+			 * being compared to a pivot tuple with one or more truncated key
+			 * attributes.
+			 *
+			 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+			 * the left here, since they have no heap TID attribute (and
+			 * cannot have any -inf key values in any case, since truncation
+			 * can only remove non-key attributes).  !heapkeyspace searches
+			 * must always be prepared to deal with matches on both sides of
+			 * the pivot once the leaf level is reached.
+			 */
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * Treat truncated heap TID as minus infinity, since scankey has a key
+		 * attribute value (scantid) that would otherwise be compared directly
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
+	}
+	else
+	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int			cmp = ItemPointerCompare(key->scantid, heapTid);
+
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			/*
+			 * if we got here, scantid is inbetween of posting items of the
+			 * tuple
+			 */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+				 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+				 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+				 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetBlockNumberNoCheck(heapTid),
+				 ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
 	}
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
-
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	return 0;
 }
 
 /*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					/* XXX: Maybe this loop should be backwards? */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e989..b058599aa4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1136,6 +1145,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd,
+				   BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
 /*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+					   IndexRelationGetNumberOfAttributes(wstate->index) &&
+					   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							_bt_add_posting_item(compressState, itup);
+						else
+							_bt_buildadd_posting(wstate, state,
+												 compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd874..77e1d46672 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
 							 && !newitemonleft);
 
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+							  BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId	 itemid;
+			IndexTuple newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+								  BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab264ae..75ba61c0c9 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->nextkey = false;
 	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+	else
+		key->scantid = NULL;
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2151,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2177,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2186,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2244,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2255,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2273,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2282,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2373,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff.  For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case.  Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8.  As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that).  In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts().  This needs to be nailed down
+ * in some way.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2477,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2532,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2559,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2611,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2639,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..538a6bc8a7 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..e4fa99ad27 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6c61..b10c0d5255 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c28e..bacc77b258 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID.  The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID().  The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +661,12 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +932,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -775,6 +946,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+								 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +1001,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+								 IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614da25..4b615e0d36 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
-- 
2.17.1

v5-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v5-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 20f251e1c3fb9da636f0844f3db9406a2090d548 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v5 3/3] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 63 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 +++++++
 3 files changed, 74 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..64423283a6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +286,49 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (!rel || !_bt_heapkeyspace(rel))
+		htid = NULL;
+	else
+		htid = BTreeTupleGetHeapTID(itup);
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +402,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +433,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +519,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v5-0002-Experimental-support-for-unique-indexes.patchapplication/octet-stream; name=v5-0002-Experimental-support-for-unique-indexes.patchDownload

From 7fd5af8b4767516c905d7cbbdc942e8d4643d025 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 27 Jul 2019 16:34:31 -0700
Subject: [PATCH v5 2/3] Experimental support for unique indexes.

I have written a pretty sloppy implementation of unique index support
for posting list compression, just to give us an idea of how it could be
done.  This seems to be a loss for performance, so it's unlikely to go
much further than this.
---
 src/backend/access/gist/gist.c        |  3 +-
 src/backend/access/hash/hashinsert.c  |  4 +-
 src/backend/access/index/genam.c      | 26 +++++++++-
 src/backend/access/nbtree/nbtinsert.c | 71 +++++++++++++++++++++++----
 src/backend/access/nbtree/nbtpage.c   |  2 +-
 src/backend/access/nbtree/nbtsearch.c |  2 +-
 src/backend/access/nbtree/nbtsort.c   |  3 +-
 src/include/access/genam.h            |  3 +-
 8 files changed, 96 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index e9ca4b8252..cfdea23cec 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1650,7 +1650,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
-												 deletable, ndeletable);
+												 deletable, ndeletable,
+												 false);
 
 	if (ndeletable > 0)
 	{
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 89876d2ccd..807e0ecd84 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -362,8 +362,8 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
 		TransactionId latestRemovedXid;
 
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, hrel, buf,
-												 deletable, ndeletable);
+			index_compute_xid_horizon_for_tuples(rel, hrel, buf, deletable,
+												 ndeletable, false);
 
 		/*
 		 * Write-lock the meta page so that we can decrement tuple count.
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..c075e6c7c7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -273,6 +273,8 @@ BuildIndexValueDescription(Relation indexRelation,
 	return buf.data;
 }
 
+#include "access/nbtree.h"
+
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
@@ -282,7 +284,8 @@ index_compute_xid_horizon_for_tuples(Relation irel,
 									 Relation hrel,
 									 Buffer ibuf,
 									 OffsetNumber *itemnos,
-									 int nitems)
+									 int nitems,
+									 bool btree)
 {
 	ItemPointerData *ttids =
 	(ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems);
@@ -298,6 +301,27 @@ index_compute_xid_horizon_for_tuples(Relation irel,
 		iitemid = PageGetItemId(ipage, itemnos[i]);
 		itup = (IndexTuple) PageGetItem(ipage, iitemid);
 
+		if (btree)
+		{
+			/*
+			 * FIXME: This is a gross modularity violation.  Clearly B-Tree
+			 * ought to pass us heap TIDs, and not require that we figure it
+			 * out on its behalf.  Also, this is just wrong, since we're
+			 * assuming that the oldest xmin is available from the lowest heap
+			 * TID.
+			 *
+			 * I haven't bothered to fix this because unique index support is
+			 * just a PoC, and will probably stay that way.  Also, since
+			 * WAL-logging is currently very inefficient, it doesn't seem very
+			 * likely that anybody will get an overly-optimistic view of the
+			 * cost of WAL logging just because we were sloppy here.
+			 */
+			if (BTreeTupleIsPosting(itup))
+			{
+				ItemPointerCopy(BTreeTupleGetHeapTID(itup), &ttids[i]);
+				continue;
+			}
+		}
 		ItemPointerCopy(&itup->t_tid, &ttids[i]);
 	}
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f0c1174e2a..4da28d9518 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -432,7 +432,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			}
 
 			curitemid = PageGetItemId(page, offset);
-			Assert(!BTreeTupleIsPosting(curitup));
 
 			/*
 			 * We can skip items that are marked killed.
@@ -449,14 +448,34 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			if (!ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
+				bool		posting;
 				bool		all_dead;
+				bool		posting_all_dead;
+				int			npost;
 
 				if (_bt_compare(rel, itup_key, page, offset) != 0)
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					posting = false;
+					posting_all_dead = true;
+				}
+				else
+				{
+					posting = true;
+					/* Initial assumption */
+					posting_all_dead = true;
+				}
+
+				npost = 0;
+doposttup:
+				if (posting)
+					htid = *BTreeTupleGetPostingN(curitup, npost);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -467,6 +486,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					ItemPointerCompare(&htid, &itup->t_tid) == 0)
 				{
 					found = true;
+					posting_all_dead = false;
+					if (posting)
+						goto nextpost;
 				}
 
 				/*
@@ -532,8 +554,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -591,7 +612,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && !posting)
 				{
 					/*
 					 * The conflicting tuple (or whole HOT chain) is dead to
@@ -610,6 +631,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+				else if (posting)
+				{
+nextpost:
+					if (!all_dead)
+						posting_all_dead = false;
+
+					/* Iterate over single posting list tuple */
+					npost++;
+					if (npost < BTreeTupleGetNPosting(curitup))
+						goto doposttup;
+
+					/*
+					 * Mark posting tuple dead if all hot chains whose root is
+					 * contained in posting tuple have tuples that are all
+					 * dead
+					 */
+					if (posting_all_dead)
+					{
+						ItemIdMarkDead(curitemid);
+						opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+						if (nbuf != InvalidBuffer)
+							MarkBufferDirtyHint(nbuf, true);
+						else
+							MarkBufferDirtyHint(insertstate->buf, true);
+					}
+
+					/* Move on to next index tuple */
+				}
 			}
 		}
 
@@ -784,7 +834,7 @@ _bt_findinsertloc(Relation rel,
 		/*
 		 * If the target page is full, try to compress the page
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
 			_bt_compress_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;		/* paranoia */
@@ -2595,12 +2645,13 @@ _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	int			natts = IndexRelationGetNumberOfAttributes(rel);
 
 	/*
-	 * Don't use compression for indexes with INCLUDEd columns and unique
-	 * indexes.
+	 * Don't use compression for indexes with INCLUDEd columns.
+	 *
+	 * Unique indexes can benefit from ad-hoc compression, though we don't do
+	 * this during CREATE INDEX.
 	 */
 	use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
-					   IndexRelationGetNumberOfAttributes(rel) &&
-					   !rel->rd_index->indisunique);
+					   IndexRelationGetNumberOfAttributes(rel));
 	if (!use_compression)
 		return;
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 86c662d4e6..985418065b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1121,7 +1121,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
 			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+												 itemnos, nitems, true);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 20975970d6..ffcfd21593 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -557,7 +557,7 @@ _bt_compare_posting(Relation rel,
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	result = _bt_compare(rel, key, page, offnum);
 
-	if (BTreeTupleIsPosting(itup) && result == 0)
+	if (BTreeTupleIsPosting(itup) && result == 0 && key->scantid)
 	{
 		int			low,
 					high,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b058599aa4..846e60a452 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1253,7 +1253,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 	/*
 	 * Don't use compression for indexes with INCLUDEd columns and unique
-	 * indexes.
+	 * indexes.  Note that unique indexes are supported with retail
+	 * insertions.
 	 */
 	use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
 					   IndexRelationGetNumberOfAttributes(wstate->index) &&
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..f9866ce7f9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -193,7 +193,8 @@ extern TransactionId index_compute_xid_horizon_for_tuples(Relation irel,
 														  Relation hrel,
 														  Buffer ibuf,
 														  OffsetNumber *itemnos,
-														  int nitems);
+														  int nitems,
+														  bool btree);
 
 /*
  * heap-or-index access to system catalogs (in genam.c)
-- 
2.17.1

#66

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#65)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

06.08.2019 4:28, Peter Geoghegan wrote:

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

Fair enough.
I think we can measure performance and make a decision, when patch will
stabilize.

* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

Could you explain more about these cases?
Now I don't understand the problem.

The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that
it doesn't
distract us from the original feature.

if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.

I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove
them if any.
Another possible solution is to copy dead items as is from old page to
the new one,
but I think it's good to remove dead tuples as fast as possible.

v5 makes _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken).

Is it possible that DEAD tuple to delete was smaller than itup?

I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.

I agree.
Should I rename all related names of functions and variables in the patch?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v6-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v6-0001-Compression-deduplication-in-nbtree.patchDownload

commit 9ac37503c71f7623413a2e406d81f5c9a4b02742
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Tue Aug 13 17:00:41 2019 +0300

    v6-0001-Compression-deduplication-in-nbtree.patch

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation.  Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5890f39..e96f5ec 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  BTStack stack,
 									  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+								  Buffer buf,
+								  IndexTuple newitup,
+								  OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+									  Buffer buf,
+									  Buffer cbuf,
+									  BTStack stack,
+									  IndexTuple itup,
+									  OffsetNumber newitemoff,
+									  bool split_only_page, int in_posting_offset);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		if (insertstate.in_posting_offset)
+			_bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+									  InvalidBuffer, stack, itup, newitemoff,
+									  false, insertstate.in_posting_offset);
+		else
+			_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+						   stack, itup, newitemoff, false);
 	}
 	else
 	{
@@ -435,6 +455,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -759,6 +780,26 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+		{
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;		/* paranoia */
+
+			/*
+			 * FIXME: _bt_vacuum_one_page() won't have cleared the
+			 * BTP_HAS_GARBAGE flag when it didn't kill items.  Maybe we
+			 * should clear the BTP_HAS_GARBAGE flag bit from the page when
+			 * compression avoids a page split -- _bt_vacuum_one_page() is
+			 * expecting a page split that takes care of it.
+			 *
+			 * (On the other hand, maybe it doesn't matter very much.  A
+			 * comment update seems like the bare minimum we should do.)
+			 */
+		}
 	}
 	else
 	{
@@ -900,6 +941,208 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+					  Buffer buf,
+					  IndexTuple newitup,
+					  OffsetNumber newitemoff)
+{
+	Page		page = BufferGetPage(buf);
+	Size		newitupsz = IndexTupleSize(newitup);
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, newitemoff);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	MarkBufferDirty(buf);
+
+	/* Xlog stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = newitemoff;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+		Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+		/*
+		 * Force full page write to keep code simple
+		 *
+		 * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+		 */
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ *		Insert a tuple on a particular page in the index
+ *		(compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+						  BTScanInsert itup_key,
+						  Buffer buf,
+						  Buffer cbuf,
+						  BTStack stack,
+						  IndexTuple itup,
+						  OffsetNumber newitemoff,
+						  bool split_only_page,
+						  int in_posting_offset)
+{
+	IndexTuple	origtup;
+	IndexTuple	lefttup;
+	IndexTuple	righttup;
+	ItemPointerData *ipd;
+	IndexTuple	newitup;
+	ItemId		itemid;
+	Page		page;
+	int			nipd,
+				nipd_right;
+
+	page = BufferGetPage(buf);
+	/* get old posting tuple */
+	itemid = PageGetItemId(page, newitemoff);
+	origtup = (IndexTuple) PageGetItem(page, itemid);
+	Assert(BTreeTupleIsPosting(origtup));
+	nipd = BTreeTupleGetNPosting(origtup);
+	Assert(in_posting_offset < nipd);
+	Assert(itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace);
+
+	elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+		 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+		 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+	/*
+	 * Fist check if existing item is dead.
+	 *
+	 * Then check if the new itempointer fits into the tuple's posting list.
+	 *
+	 * Also check if new itempointer fits into the page.
+	 *
+	 * If not, posting tuple's split is required in both cases.
+	 *
+	 * XXX: Think some more about alignment - pg
+	 */
+	if (ItemIdIsDead(itemid))
+	{
+		/* FIXME: We need to call index_compute_xid_horizon_for_tuples() */
+		elog(DEBUG4, "replacing LP_DEAD posting list item, new off %d",
+			 newitemoff);
+		_bt_delete_and_insert(rel, buf, itup, newitemoff);
+		_bt_relbuf(rel, buf);
+	}
+	else if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+			 PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+	{
+		/*
+		 * Split posting tuple into two halves.
+		 *
+		 * Left tuple contains all item pointes less than the new one and
+		 * right tuple contains new item pointer and all to the right.
+		 *
+		 * TODO Probably we can come up with more clever algorithm.
+		 */
+		lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+										in_posting_offset);
+
+		nipd_right = nipd - in_posting_offset + 1;
+		ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+		/* insert new item pointer */
+		memcpy(ipd, itup, sizeof(ItemPointerData));
+		/* copy item pointers from original tuple that belong on right */
+		memcpy(ipd + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+		elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+			 nipd, in_posting_offset);
+
+		Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+								  BTreeTupleGetHeapTID(righttup)) < 0);
+
+		/*
+		 * Replace old tuple with a left tuple on a page.
+		 *
+		 * And insert righttuple using ordinary _bt_insertonpg() function If
+		 * split is required, _bt_insertonpg will handle it.
+		 *
+		 * FIXME: This doesn't seem very crash safe -- what if we fail after
+		 * _bt_delete_and_insert() but before _bt_insertonpg()?  We could
+		 * crash and then lose some of the logical tuples that used to be
+		 * contained within original posting list, but will now go into new
+		 * righttup posting list.
+		 */
+		_bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+					   stack, righttup, newitemoff + 1, false);
+
+		pfree(ipd);
+		pfree(lefttup);
+		pfree(righttup);
+	}
+	else
+	{
+		ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+		elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+		/* copy item pointers from original tuple into ipd */
+		memcpy(ipd, BTreeTupleGetPosting(origtup),
+			   sizeof(ItemPointerData) * in_posting_offset);
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+		/* copy item pointers from old tuple into ipd */
+		memcpy(ipd + in_posting_offset + 1,
+			   BTreeTupleGetPostingN(origtup, in_posting_offset),
+			   sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+		newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+		_bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+		pfree(ipd);
+		pfree(newitup);
+		_bt_relbuf(rel, buf);
+	}
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -2290,3 +2533,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "failed to add tuple to page while compresing it");
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+					   IndexRelationGetNumberOfAttributes(rel) &&
+					   !rel->rd_index->indisunique);
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	
+	/*
+	 * Delete dead tuples if any.
+	 * We cannot simply skip them in the cycle below, because it's neccessary
+	 * to generate special Xlog record containing such tuples to compute
+	 * latestRemovedXid on a standby server later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and  _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemid = PageGetItemId(page, P_HIKEY);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during compression");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(compressState, itup);
+				}
+				else
+				{
+					insert_itupprev_to_page(newpage, compressState);
+				}
+			}
+			else
+			{
+				insert_itupprev_to_page(newpage, compressState);
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	insert_itupprev_to_page(newpage, compressState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de..86c662d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1431,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf..2097597 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid,
+									 &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+					BTScanInsert key,
+					Page page,
+					OffsetNumber offnum,
+					int *in_posting_offset)
+{
+	IndexTuple	itup;
+	int			result;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	result = _bt_compare(rel, key, page, offnum);
+
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int			low,
+					high,
+					mid,
+					res;
+
+		low = 0;
+		/* "high" is past end of posting list for loop invariant */
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid,
+									 BTreeTupleGetPostingN(itup, mid));
+
+			if (res >= 1)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+
+		*in_posting_offset = high;
+	}
+
+	return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex. It is
+	 * possible that the scankey belongs to the tuple's posting list TID
+	 * range.
+	 *
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+	 * matches tuple at this offset.
+	 *
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			 * Most searches have a scankey that is considered greater than a
+			 * truncated pivot tuple if and when the scankey has equal values
+			 * for attributes up to and including the least significant
+			 * untruncated attribute in tuple.
+			 *
+			 * For example, if an index has the minimum two attributes (single
+			 * user key attribute, plus heap TID attribute), and a page's high
+			 * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+			 * search will not descend to the page to the left.  The search
+			 * will descend right instead.  The truncated attribute in pivot
+			 * tuple means that all non-pivot tuples on the page to the left
+			 * are strictly < 'foo', so it isn't necessary to descend left. In
+			 * other words, search doesn't have to descend left because it
+			 * isn't interested in a match that has a heap TID value of -inf.
+			 *
+			 * However, some searches (pivotsearch searches) actually require
+			 * that we descend left when this happens.  -inf is treated as a
+			 * possible match for omitted scankey attribute(s).  This is
+			 * needed by page deletion, which must re-find leaf pages that are
+			 * targets for deletion using their high keys.
+			 *
+			 * Note: the heap TID part of the test ensures that scankey is
+			 * being compared to a pivot tuple with one or more truncated key
+			 * attributes.
+			 *
+			 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+			 * the left here, since they have no heap TID attribute (and
+			 * cannot have any -inf key values in any case, since truncation
+			 * can only remove non-key attributes).  !heapkeyspace searches
+			 * must always be prepared to deal with matches on both sides of
+			 * the pivot once the leaf level is reached.
+			 */
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * Treat truncated heap TID as minus infinity, since scankey has a key
+		 * attribute value (scantid) that would otherwise be compared directly
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
 	}
+	else
+	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int			cmp = ItemPointerCompare(key->scantid, heapTid);
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
 
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			/*
+			 * if we got here, scantid is inbetween of posting items of the
+			 * tuple
+			 */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+				 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+				 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+				 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetBlockNumberNoCheck(heapTid),
+				 ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
+	}
+
+	return 0;
 }
 
 /*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					/* XXX: Maybe this loop should be backwards? */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e..b058599 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1137,6 +1146,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd,
+				   BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+					   IndexRelationGetNumberOfAttributes(wstate->index) &&
+					   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							_bt_add_posting_item(compressState, itup);
+						else
+							_bt_buildadd_posting(wstate, state,
+												 compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..77e1d46 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
 							 && !newitemonleft);
 
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+							  BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId	 itemid;
+			IndexTuple newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+								  BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..9552acb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->nextkey = false;
 	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+	else
+		key->scantid = NULL;
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2151,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2177,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2186,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2244,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2255,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2273,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2282,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2373,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff.  For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case.  Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8.  As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that).  In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts().  This needs to be nailed down
+ * in some way.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2477,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2532,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2559,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2611,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2639,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..538a6bc 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..bacc77b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID.  The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID().  The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -501,6 +662,12 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -763,6 +932,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -775,6 +946,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+								 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +1001,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+								 IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..4b615e0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

#67

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Anastasia Lubennikova (#66)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

13.08.2019 18:45, Anastasia Lubennikova wrote:

I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

Attached is v7.

In this version of the patch, I heavily refactored the code of insertion
into
posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting
tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two
tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch
passes
all regression tests, so I think it's ready for review again.

In the meantime, I'll run more stress-tests.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v7-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v7-0001-Compression-deduplication-in-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation.  Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5890f39..fed1e86 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,21 +41,28 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 									  BTStack stack,
 									  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+								  Buffer buf,
+								  Page page,
+								  IndexTuple newitup, IndexTuple newitupright,
+								  OffsetNumber newitemoff);
 static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   Buffer buf,
 						   Buffer cbuf,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
-						   bool split_only_page);
+						   bool split_only_page, int in_posting_offset);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple lefttup, IndexTuple righttup);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +304,13 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, newitemoff, false,
+					   insertstate.in_posting_offset);
 	}
 	else
 	{
@@ -435,6 +445,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -759,6 +770,26 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+		{
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;		/* paranoia */
+
+			/*
+			 * FIXME: _bt_vacuum_one_page() won't have cleared the
+			 * BTP_HAS_GARBAGE flag when it didn't kill items.  Maybe we
+			 * should clear the BTP_HAS_GARBAGE flag bit from the page when
+			 * compression avoids a page split -- _bt_vacuum_one_page() is
+			 * expecting a page split that takes care of it.
+			 *
+			 * (On the other hand, maybe it doesn't matter very much.  A
+			 * comment update seems like the bare minimum we should do.)
+			 */
+		}
 	}
 	else
 	{
@@ -900,6 +931,77 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ *
+ * If original posting tuple was split, 'newitup' represents left part of
+ * original tuple and 'newitupright' is it's right part, that must be inserted
+ * next to newitemoff.
+ * It's essential to do this atomic to be crash safe.
+ *
+ * NOTE All checks of free space must be done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+					  Buffer buf,
+					  Page page,
+					  IndexTuple newitup, IndexTuple newitupright,
+					  OffsetNumber newitemoff)
+{
+	Size		newitupsz = IndexTupleSize(newitup);
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	elog(DEBUG4, "_bt_delete_and_insert %s newitemoff %d",
+				  RelationGetRelationName(rel), newitemoff);
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, newitemoff);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	if (newitupright)
+	{
+		if (!_bt_pgaddtup(page, MAXALIGN(IndexTupleSize(newitupright)),
+						  newitupright, OffsetNumberNext(newitemoff)))
+			elog(ERROR, "failed to insert compressed item in index \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+
+	if (BufferIsValid(buf))
+	{
+		MarkBufferDirty(buf);
+
+		/* Xlog stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_btree_insert xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.offnum = newitemoff;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+			Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+			/*
+			* Force full page write to keep code simple
+			*
+			* TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+			*/
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+			PageSetLSN(page, recptr);
+		}
+	}
+	END_CRIT_SECTION();
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -936,11 +1038,17 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   int in_posting_offset)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	origtup;
+	int			nipd;
+	IndexTuple	neworigtup = NULL;
+	IndexTuple	newrighttup = NULL;
+	bool		need_split = false;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -965,13 +1073,184 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * If new tuple's key is equal to the key of a posting tuple that already
+	 * exists on the page and it's TID falls inside the min/max range of
+	 * existing posting list, update the posting tuple.
+	 *
+	 * TODO Think of moving this to a separate function.
+	 *
+	 * TODO possible optimization:
+	 *		if original posting tuple is dead,
+	 *		reset in_posting_offset and handle itup as a regular tuple
+	 */
+	if (in_posting_offset)
+	{
+		/* get old posting tuple */
+		ItemId 			itemid = PageGetItemId(page, newitemoff);
+		ItemPointerData *ipd;
+		int				nipd, nipd_right;
+		bool			need_posting_split = false;
+
+		origtup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPosting(origtup));
+		nipd = BTreeTupleGetNPosting(origtup);
+		Assert(in_posting_offset < nipd);
+		Assert(itup_key->scantid != NULL);
+		Assert(itup_key->heapkeyspace);
+
+		elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+		/* check if posting tuple must be splitted */
+		if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + sizeof(ItemPointerData))
+			need_posting_split = true;
+
+		/*
+		 * If page split is needed, always split posting tuple.
+		 * Probably that is not the most optimal,
+		 * but it allows to simplify _bt_split code.
+		 *
+		 * TODO Does this decision have any significant drawbacks?
+		 */
+		if (PageGetFreeSpace(page) < sizeof(ItemPointerData))
+			need_posting_split = true;
+
+		/*
+		 * Handle corner cases (1)
+		 *		- itup TID is smaller than leftmost orightup TID
+		 */
+		if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+								BTreeTupleGetHeapTID(origtup)) < 0)
+		{
+			if (need_posting_split)
+			{
+				/*
+				 * cannot avoid split, so no need in trying to fit itup into posting list.
+				 * handle itup insertion as regular tuple insertion
+				 */
+				elog(DEBUG4, "split posting tuple. itup is to the left of origtup");
+				in_posting_offset = InvalidOffsetNumber;
+				newitemoff = OffsetNumberPrev(newitemoff);
+			}
+			else
+			{
+				ipd = palloc0(nipd + 1);
+				/* insert new item pointer */
+				memcpy(ipd, itup, sizeof(ItemPointerData));
+				/* copy item pointers from original tuple that belong on right */
+				memcpy(ipd + 1, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+				neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+				pfree(ipd);
+
+				Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+										  BTreeTupleGetMaxTID(neworigtup)) < 0);
+			}
+		}
+
+		/*
+		 * Handle corner cases (2)
+		 *		- itup TID is larger than rightmost orightup TID
+		 */
+		if (ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+							   BTreeTupleGetHeapTID(itup)) < 0)
+		{
+			if (need_posting_split)
+			{
+				/*
+				 * cannot avoid split, so no need in trying to fit itup into posting list.
+				 * handle itup insertion as regular tuple insertion
+				 */
+				elog(DEBUG4, "split posting tuple. itup is to the right of origtup");
+				in_posting_offset = InvalidOffsetNumber;
+			}
+			else
+			{
+				ipd = palloc0(nipd + 1);
+				/* insert new item pointer */
+				/* copy item pointers from original tuple that belong on right */
+				memcpy(ipd, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+				memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+				neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+				pfree(ipd);
+
+				Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+										  BTreeTupleGetMaxTID(neworigtup)) < 0);
+			}
+		}
+
+		/*
+		 * itup TID belongs to TID range of origtup posting list
+		 *
+		 * Split posting tuple into two halves.
+		 *
+		 * neworigtup (left) tuple contains all item pointers less than the new one and
+		 * newrighttup tuple contains new item pointer and all to the right.
+		 */
+		if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+							   BTreeTupleGetHeapTID(origtup)) > 0
+			&&
+			ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+							   BTreeTupleGetHeapTID(itup)) > 0)
+		{
+			neworigtup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+											in_posting_offset);
+
+			nipd_right = nipd - in_posting_offset + 1;
+
+			elog(DEBUG4, "split posting tuple in_posting_offset %d nipd %d nipd_right %d",
+						 in_posting_offset, nipd, nipd_right);
+
+			ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+			/* insert new item pointer */
+			memcpy(ipd, itup, sizeof(ItemPointerData));
+			/* copy item pointers from original tuple that belong on right */
+			memcpy(ipd + 1,
+				BTreeTupleGetPostingN(origtup, in_posting_offset),
+				sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+			newrighttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+
+			Assert(ItemPointerCompare(BTreeTupleGetMaxTID(neworigtup),
+									BTreeTupleGetHeapTID(newrighttup)) < 0);
+			pfree(ipd);
+
+			elog(DEBUG4, "left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+				BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+				BTreeTupleIsPosting(newrighttup)?BTreeTupleGetNPosting(newrighttup):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)));
+
+			/*
+				* check if splitted tuple still fit into original page
+				* TODO should we add sizeof(ItemIdData) in this check?
+				*/
+			if (PageGetFreeSpace(page) < (MAXALIGN(IndexTupleSize(neworigtup))
+											+ MAXALIGN(IndexTupleSize(newrighttup))
+											- MAXALIGN(IndexTupleSize(origtup))))
+				need_split = true;
+		}
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
 	 * so this comparison is correct even though we appear to be accounting
 	 * only for the item and not for its line pointer.
 	 */
-	if (PageGetFreeSpace(page) < itemsz)
+	if (PageGetFreeSpace(page) < itemsz || need_split)
 	{
 		bool		is_root = P_ISROOT(lpageop);
 		bool		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
@@ -996,7 +1275,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 neworigtup, newrighttup);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1033,142 +1313,159 @@ _bt_insertonpg(Relation rel,
 		itup_off = newitemoff;
 		itup_blkno = BufferGetBlockNumber(buf);
 
-		/*
-		 * If we are doing this insert because we split a page that was the
-		 * only one on its tree level, but was not the root, it may have been
-		 * the "fast root".  We need to ensure that the fast root link points
-		 * at or above the current page.  We can safely acquire a lock on the
-		 * metapage here --- see comments for _bt_newroot().
-		 */
-		if (split_only_page)
+		if (!in_posting_offset)
 		{
-			Assert(!P_ISLEAF(lpageop));
-
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
-			metapg = BufferGetPage(metabuf);
-			metad = BTPageGetMeta(metapg);
-
-			if (metad->btm_fastlevel >= lpageop->btpo.level)
+			/*
+			* If we are doing this insert because we split a page that was the
+			* only one on its tree level, but was not the root, it may have been
+			* the "fast root".  We need to ensure that the fast root link points
+			* at or above the current page.  We can safely acquire a lock on the
+			* metapage here --- see comments for _bt_newroot().
+			*/
+			if (split_only_page)
 			{
-				/* no update wanted */
-				_bt_relbuf(rel, metabuf);
-				metabuf = InvalidBuffer;
-			}
-		}
-
-		/*
-		 * Every internal page should have exactly one negative infinity item
-		 * at all times.  Only _bt_split() and _bt_newroot() should add items
-		 * that become negative infinity items through truncation, since
-		 * they're the only routines that allocate new internal pages.  Do not
-		 * allow a retail insertion of a new item at the negative infinity
-		 * offset.
-		 */
-		if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
-			elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
-				 itup_blkno, RelationGetRelationName(rel));
+				Assert(!P_ISLEAF(lpageop));
 
-		/* Do the update.  No ereport(ERROR) until changes are logged */
-		START_CRIT_SECTION();
+				metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+				metapg = BufferGetPage(metabuf);
+				metad = BTPageGetMeta(metapg);
 
-		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
-			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
-				 itup_blkno, RelationGetRelationName(rel));
+				if (metad->btm_fastlevel >= lpageop->btpo.level)
+				{
+					/* no update wanted */
+					_bt_relbuf(rel, metabuf);
+					metabuf = InvalidBuffer;
+				}
+			}
 
-		MarkBufferDirty(buf);
+			/*
+			* Every internal page should have exactly one negative infinity item
+			* at all times.  Only _bt_split() and _bt_newroot() should add items
+			* that become negative infinity items through truncation, since
+			* they're the only routines that allocate new internal pages.  Do not
+			* allow a retail insertion of a new item at the negative infinity
+			* offset.
+			*/
+			if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
+				elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
+					itup_blkno, RelationGetRelationName(rel));
+
+			/* Do the update.  No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
+			if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
+				elog(PANIC, "failed to add new item to block %u in index \"%s\"",
+					itup_blkno, RelationGetRelationName(rel));
+
+			MarkBufferDirty(buf);
 
-		if (BufferIsValid(metabuf))
-		{
-			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_NOVAC_VERSION)
-				_bt_upgrademetapage(metapg);
-			metad->btm_fastroot = itup_blkno;
-			metad->btm_fastlevel = lpageop->btpo.level;
-			MarkBufferDirty(metabuf);
-		}
+			if (BufferIsValid(metabuf))
+			{
+				/* upgrade meta-page if needed */
+				if (metad->btm_version < BTREE_NOVAC_VERSION)
+					_bt_upgrademetapage(metapg);
+				metad->btm_fastroot = itup_blkno;
+				metad->btm_fastlevel = lpageop->btpo.level;
+				MarkBufferDirty(metabuf);
+			}
 
-		/* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
-		if (BufferIsValid(cbuf))
-		{
-			Page		cpage = BufferGetPage(cbuf);
-			BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+			/* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
+			if (BufferIsValid(cbuf))
+			{
+				Page		cpage = BufferGetPage(cbuf);
+				BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
 
-			Assert(P_INCOMPLETE_SPLIT(cpageop));
-			cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
-			MarkBufferDirty(cbuf);
-		}
+				Assert(P_INCOMPLETE_SPLIT(cpageop));
+				cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+				MarkBufferDirty(cbuf);
+			}
 
-		/*
-		 * Cache the block information if we just inserted into the rightmost
-		 * leaf page of the index and it's not the root page.  For very small
-		 * index where root is also the leaf, there is no point trying for any
-		 * optimization.
-		 */
-		if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
-			cachedBlock = BufferGetBlockNumber(buf);
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_btree_insert xlrec;
+				xl_btree_metadata xlmeta;
+				uint8		xlinfo;
+				XLogRecPtr	recptr;
 
-		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
-		{
-			xl_btree_insert xlrec;
-			xl_btree_metadata xlmeta;
-			uint8		xlinfo;
-			XLogRecPtr	recptr;
+				xlrec.offnum = itup_off;
 
-			xlrec.offnum = itup_off;
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+				if (P_ISLEAF(lpageop))
+					xlinfo = XLOG_BTREE_INSERT_LEAF;
+				else
+				{
+					/*
+					* Register the left child whose INCOMPLETE_SPLIT flag was
+					* cleared.
+					*/
+					XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
 
-			if (P_ISLEAF(lpageop))
-				xlinfo = XLOG_BTREE_INSERT_LEAF;
-			else
-			{
-				/*
-				 * Register the left child whose INCOMPLETE_SPLIT flag was
-				 * cleared.
-				 */
-				XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
+					xlinfo = XLOG_BTREE_INSERT_UPPER;
+				}
 
-				xlinfo = XLOG_BTREE_INSERT_UPPER;
-			}
+				if (BufferIsValid(metabuf))
+				{
+					Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+					xlmeta.version = metad->btm_version;
+					xlmeta.root = metad->btm_root;
+					xlmeta.level = metad->btm_level;
+					xlmeta.fastroot = metad->btm_fastroot;
+					xlmeta.fastlevel = metad->btm_fastlevel;
+					xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+					xlmeta.last_cleanup_num_heap_tuples =
+						metad->btm_last_cleanup_num_heap_tuples;
+
+					XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+					XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
+
+					xlinfo = XLOG_BTREE_INSERT_META;
+				}
 
-			if (BufferIsValid(metabuf))
-			{
-				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
-				xlmeta.version = metad->btm_version;
-				xlmeta.root = metad->btm_root;
-				xlmeta.level = metad->btm_level;
-				xlmeta.fastroot = metad->btm_fastroot;
-				xlmeta.fastlevel = metad->btm_fastlevel;
-				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
-				xlmeta.last_cleanup_num_heap_tuples =
-					metad->btm_last_cleanup_num_heap_tuples;
-
-				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
-				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
-
-				xlinfo = XLOG_BTREE_INSERT_META;
-			}
+				XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
 
-			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+				recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
-			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+				if (BufferIsValid(metabuf))
+				{
+					PageSetLSN(metapg, recptr);
+				}
+				if (BufferIsValid(cbuf))
+				{
+					PageSetLSN(BufferGetPage(cbuf), recptr);
+				}
 
-			if (BufferIsValid(metabuf))
-			{
-				PageSetLSN(metapg, recptr);
-			}
-			if (BufferIsValid(cbuf))
-			{
-				PageSetLSN(BufferGetPage(cbuf), recptr);
+				PageSetLSN(page, recptr);
 			}
 
-			PageSetLSN(page, recptr);
+			END_CRIT_SECTION();
+		}
+		else
+		{
+			/*
+			 * Insert new tuple on place of existing posting tuple.
+			 * Delete old posting tuple, and insert updated tuple instead.
+			 *
+			 * If split was needed, both neworigtup and newrighttup are initialized
+			 * and both will be inserted, otherwise newrighttup is NULL.
+			 *
+			 * It only can happen on leaf page.
+			 */
+			elog(DEBUG4, "_bt_insertonpg. _bt_delete_and_insert %s",  RelationGetRelationName(rel));
+			_bt_delete_and_insert(rel, buf, page, neworigtup, newrighttup, newitemoff);
 		}
 
-		END_CRIT_SECTION();
+		/*
+		 * Cache the block information if we just inserted into the rightmost
+		 * leaf page of the index and it's not the root page.  For very small
+		 * index where root is also the leaf, there is no point trying for any
+		 * optimization.
+		 */
+		if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
+				cachedBlock = BufferGetBlockNumber(buf);
 
 		/* release buffers */
 		if (BufferIsValid(metabuf))
@@ -1214,7 +1511,8 @@ _bt_insertonpg(Relation rel,
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple lefttup, IndexTuple righttup)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1534,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replaceitemoff = InvalidOffsetNumber;
+	Size		replaceitemsz;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1543,24 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * If we're working with splitted posting tuple,
+	 * new tuple is actually contained in righttup posting list
+	 */
+	if (righttup)
+	{
+		newitem = righttup;
+		newitemsz = MAXALIGN(IndexTupleSize(righttup));
+
+		/*
+		 * actual insertion is a replacement of origtup with lefttup
+		 * and insertion of righttup (as newitem) next to it.
+		 */
+		replaceitemoff = newitemoff;
+		replaceitemsz = MAXALIGN(IndexTupleSize(lefttup));
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1275,7 +1593,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * (but not always) redundant information.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
-								  newitem, &newitemonleft);
+								  newitem, replaceitemoff, replaceitemsz,
+								  lefttup, &newitemonleft);
 
 	/* Allocate temp buffer for leftpage */
 	leftpage = PageGetTempPage(origpage);
@@ -1364,6 +1683,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			/* incoming tuple will become last on left page */
 			lastleft = newitem;
 		}
+		else if (!newitemonleft && newitemoff == firstright && lefttup)
+		{
+			/*
+			 * if newitem is first on the right page
+			 * and split posting tuple handle is reuqired,
+			 * lastleft will be replaced with lefttup,
+			 * so use it here
+			 */
+			elog(DEBUG4, "lastleft = lefttup firstright %d", firstright);
+			lastleft = lefttup;
+		}
 		else
 		{
 			OffsetNumber lastleftoff;
@@ -1480,6 +1810,39 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		if (i == replaceitemoff)
+		{
+			if (replaceitemoff <= firstright)
+			{
+				elog(DEBUG4, "_bt_split left. replaceitem block %u %s replaceitemoff %d leftoff %d", 
+					origpagenumber, RelationGetRelationName(rel), replaceitemoff, leftoff);
+				if (!_bt_pgaddtup(leftpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, leftoff))
+				{
+					memset(rightpage, 0, BufferGetPageSize(rbuf));
+					elog(ERROR, "failed to add new item to the left sibling"
+						 " while splitting block %u of index \"%s\"",
+						 origpagenumber, RelationGetRelationName(rel));
+				}
+				leftoff = OffsetNumberNext(leftoff);
+			}
+			else
+			{
+				elog(DEBUG4, "_bt_split right. replaceitem block %u %s replaceitemoff %d newitemoff %d", 
+					 origpagenumber, RelationGetRelationName(rel), replaceitemoff, newitemoff);
+				elog(DEBUG4, "_bt_split right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
+
+				if (!_bt_pgaddtup(rightpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, rightoff))
+				{
+					memset(rightpage, 0, BufferGetPageSize(rbuf));
+					elog(ERROR, "failed to add new item to the right sibling"
+						 " while splitting block %u of index \"%s\", rightoff %d",
+						 origpagenumber, RelationGetRelationName(rel), rightoff);
+				}
+				rightoff = OffsetNumberNext(rightoff);
+			}
+			continue;
+		}
+
 		/* does new item belong before this one? */
 		if (i == newitemoff)
 		{
@@ -1497,13 +1860,14 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			}
 			else
 			{
+				elog(DEBUG4, "insert newitem to the right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
 				Assert(newitemoff >= firstright);
 				if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
 				{
 					memset(rightpage, 0, BufferGetPageSize(rbuf));
 					elog(ERROR, "failed to add new item to the right sibling"
-						 " while splitting block %u of index \"%s\"",
-						 origpagenumber, RelationGetRelationName(rel));
+						 " while splitting block %u of index \"%s\", rightoff %d",
+						 origpagenumber, RelationGetRelationName(rel), rightoff);
 				}
 				rightoff = OffsetNumberNext(rightoff);
 			}
@@ -1547,8 +1911,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		{
 			memset(rightpage, 0, BufferGetPageSize(rbuf));
 			elog(ERROR, "failed to add new item to the right sibling"
-				 " while splitting block %u of index \"%s\"",
-				 origpagenumber, RelationGetRelationName(rel));
+				 " while splitting block %u of index \"%s\" rightoff %d",
+				 origpagenumber, RelationGetRelationName(rel), rightoff);
 		}
 		rightoff = OffsetNumberNext(rightoff);
 	}
@@ -1837,7 +2201,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively update the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
-					   is_only);
+					   is_only, 0);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2290,3 +2654,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "failed to add tuple to page while compresing it");
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+					   IndexRelationGetNumberOfAttributes(rel) &&
+					   !rel->rd_index->indisunique);
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	
+	/*
+	 * Delete dead tuples if any.
+	 * We cannot simply skip them in the cycle below, because it's neccessary
+	 * to generate special Xlog record containing such tuples to compute
+	 * latestRemovedXid on a standby server later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and  _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemid = PageGetItemId(page, P_HIKEY);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during compression");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(compressState, itup);
+				}
+				else
+				{
+					insert_itupprev_to_page(newpage, compressState);
+				}
+			}
+			else
+			{
+				insert_itupprev_to_page(newpage, compressState);
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	insert_itupprev_to_page(newpage, compressState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de..86c662d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1431,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf..de0af9e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid,
+									 &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -533,6 +537,60 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+					BTScanInsert key,
+					Page page,
+					OffsetNumber offnum,
+					int *in_posting_offset)
+{
+	IndexTuple	itup;
+	int			result;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	result = _bt_compare(rel, key, page, offnum);
+
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int			low,
+					high,
+					mid,
+					res;
+
+		low = 0;
+		/* "high" is past end of posting list for loop invariant */
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid,
+									 BTreeTupleGetPostingN(itup, mid));
+
+			if (res >= 1)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+
+		*in_posting_offset = high;
+		elog(DEBUG4, "_bt_compare_posting in_posting_offset %d", *in_posting_offset);
+		Assert(ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+							  key->scantid) < 0);
+		Assert(ItemPointerCompare(key->scantid,
+							  BTreeTupleGetMaxTID(itup)) < 0);
+	}
+
+	return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -665,61 +723,120 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex. It is
+	 * possible that the scankey belongs to the tuple's posting list TID
+	 * range.
+	 *
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+	 * matches tuple at this offset.
+	 *
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			 * Most searches have a scankey that is considered greater than a
+			 * truncated pivot tuple if and when the scankey has equal values
+			 * for attributes up to and including the least significant
+			 * untruncated attribute in tuple.
+			 *
+			 * For example, if an index has the minimum two attributes (single
+			 * user key attribute, plus heap TID attribute), and a page's high
+			 * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+			 * search will not descend to the page to the left.  The search
+			 * will descend right instead.  The truncated attribute in pivot
+			 * tuple means that all non-pivot tuples on the page to the left
+			 * are strictly < 'foo', so it isn't necessary to descend left. In
+			 * other words, search doesn't have to descend left because it
+			 * isn't interested in a match that has a heap TID value of -inf.
+			 *
+			 * However, some searches (pivotsearch searches) actually require
+			 * that we descend left when this happens.  -inf is treated as a
+			 * possible match for omitted scankey attribute(s).  This is
+			 * needed by page deletion, which must re-find leaf pages that are
+			 * targets for deletion using their high keys.
+			 *
+			 * Note: the heap TID part of the test ensures that scankey is
+			 * being compared to a pivot tuple with one or more truncated key
+			 * attributes.
+			 *
+			 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+			 * the left here, since they have no heap TID attribute (and
+			 * cannot have any -inf key values in any case, since truncation
+			 * can only remove non-key attributes).  !heapkeyspace searches
+			 * must always be prepared to deal with matches on both sides of
+			 * the pivot once the leaf level is reached.
+			 */
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * Treat truncated heap TID as minus infinity, since scankey has a key
+		 * attribute value (scantid) that would otherwise be compared directly
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
 	}
+	else
+	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int			cmp = ItemPointerCompare(key->scantid, heapTid);
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
 
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			/*
+			 * if we got here, scantid is inbetween of posting items of the
+			 * tuple
+			 */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+				 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+				 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+				 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetBlockNumberNoCheck(heapTid),
+				 ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
+	}
+
+	return 0;
 }
 
 /*
@@ -1456,6 +1573,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1608,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1524,7 +1656,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1532,7 +1664,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1574,8 +1706,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					/* XXX: Maybe this loop should be backwards? */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1589,8 +1736,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1750,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1615,6 +1764,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e..b058599 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the tuple space previously occupied by oitup.  This is fairly
 			 * cheap.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1137,6 +1146,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd,
+				   BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+					   IndexRelationGetNumberOfAttributes(wstate->index) &&
+					   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							_bt_add_posting_item(compressState, itup);
+						else
+							_bt_buildadd_posting(wstate, state,
+												 compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..c492b04 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -62,6 +62,11 @@ typedef struct
 	int			nsplits;		/* current number of splits */
 	SplitPoint *splits;			/* all candidate split points for page */
 	int			interval;		/* current range of acceptable split points */
+
+	/* fields only valid when insert splitted posting tuple */
+	OffsetNumber replaceitemoff;
+	IndexTuple	 replaceitem;
+	Size		 replaceitemsz;
 } FindSplitData;
 
 static void _bt_recsplitloc(FindSplitData *state,
@@ -129,6 +134,9 @@ _bt_findsplitloc(Relation rel,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
 				 IndexTuple newitem,
+				 OffsetNumber replaceitemoff,
+				 Size replaceitemsz,
+				 IndexTuple replaceitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -183,6 +191,10 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	state.replaceitemoff = replaceitemoff;
+	state.replaceitemsz = replaceitemsz;
+	state.replaceitem = replaceitem;
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -207,7 +219,17 @@ _bt_findsplitloc(Relation rel,
 		Size		itemsz;
 
 		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/* use size of replacing item for calculations */
+		if (offnum == replaceitemoff)
+		{
+			itemsz = replaceitemsz + sizeof(ItemIdData);
+			olddataitemstotal = state.olddataitemstotal = state.olddataitemstotal
+									  - MAXALIGN(ItemIdGetLength(itemid))
+									  + replaceitemsz;
+		}
+		else
+			itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
 
 		/*
 		 * When item offset number is not newitemoff, neither side of the
@@ -466,9 +488,13 @@ _bt_recsplitloc(FindSplitData *state,
 							 && !newitemonleft);
 
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
+	}
 
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
@@ -492,12 +518,12 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
-	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
-	else
-		leftfree -= (int16) firstrightitemsz;
+	leftfree -= (int16) firstrightitemsz;
 
 	/* account for the new item */
 	if (newitemonleft)
@@ -1066,13 +1092,20 @@ static inline IndexTuple
 _bt_split_lastleft(FindSplitData *state, SplitPoint *split)
 {
 	ItemId		itemid;
+	OffsetNumber offset;
 
 	if (split->newitemonleft && split->firstoldonright == state->newitemoff)
 		return state->newitem;
 
-	itemid = PageGetItemId(state->page,
-						   OffsetNumberPrev(split->firstoldonright));
-	return (IndexTuple) PageGetItem(state->page, itemid);
+	offset = OffsetNumberPrev(split->firstoldonright);
+	if (offset == state->replaceitemoff)
+		return state->replaceitem;
+	else
+	{
+		itemid = PageGetItemId(state->page,
+							OffsetNumberPrev(split->firstoldonright));
+		return (IndexTuple) PageGetItem(state->page, itemid);
+	}
 }
 
 /*
@@ -1086,6 +1119,11 @@ _bt_split_firstright(FindSplitData *state, SplitPoint *split)
 	if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
 		return state->newitem;
 
-	itemid = PageGetItemId(state->page, split->firstoldonright);
-	return (IndexTuple) PageGetItem(state->page, itemid);
+	if (split->firstoldonright == state->replaceitemoff)
+		return state->replaceitem;
+	else
+	{
+		itemid = PageGetItemId(state->page, split->firstoldonright);
+		return (IndexTuple) PageGetItem(state->page, itemid);
+	}
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..c56f5ab 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->nextkey = false;
 	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+	else
+		key->scantid = NULL;
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2112,6 +2118,7 @@ btbuildphasename(int64 phasenum)
  * returning an enlarged tuple to caller when truncation + TOAST compression
  * ends up enlarging the final datum.
  */
+
 IndexTuple
 _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 			 BTScanInsert itup_key)
@@ -2124,6 +2131,17 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	ItemPointer pivotheaptid;
 	Size		newsize;
 
+	elog(DEBUG4, "_bt_truncate left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+				BTreeTupleIsPosting(lastleft)?BTreeTupleGetNPosting(lastleft):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+				BTreeTupleIsPosting(firstright)?BTreeTupleGetNPosting(firstright):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(firstright)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(firstright)));
 	/*
 	 * We should only ever truncate leaf index tuples.  It's never okay to
 	 * truncate a second time.
@@ -2145,6 +2163,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2189,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2198,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2256,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2267,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2285,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2294,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2385,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff.  For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case.  Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8.  As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that).  In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts().  This needs to be nailed down
+ * in some way.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2489,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2544,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2571,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2623,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2651,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..538a6bc 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..3064afb 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID.  The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID().  The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -501,6 +662,12 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -739,6 +908,8 @@ extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
  */
 extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 									 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+									 OffsetNumber replaceitemoff, Size replaceitemsz,
+									 IndexTuple replaceitem,
 									 bool *newitemonleft);
 
 /*
@@ -763,6 +934,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -775,6 +948,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+								 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +988,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -825,5 +1003,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+								 IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..4b615e0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

#68

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#66)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

Could you explain more about these cases?
Now I don't understand the problem.

Maybe there is no problem.

Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that
it doesn't
distract us from the original feature.

I don't even think that it's useful work for the future. It's just
nice to be sure that we could support unique index deduplication if it
made sense. Which it doesn't. If I didn't write the patch that
implements deduplication for unique indexes, I might still not realize
that we need the index_compute_xid_horizon_for_tuples() stuff in
certain other places. I'm not serious about it at all, except as a
learning exercise/experiment.

I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove
them if any.
Another possible solution is to copy dead items as is from old page to
the new one,
but I think it's good to remove dead tuples as fast as possible.

I think that what you've done in v7 is probably the best way to do it.
It's certainly simple, which is appropriate given that we're not
really expecting to see LP_DEAD items within _bt_compress_one_page()
(we just need to be prepared for them).

v5 makes _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken).

Is it possible that DEAD tuple to delete was smaller than itup?

I'm not sure what you mean by this. I suppose that it doesn't matter,
since we both prefer the alternative that you came up with anyway.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.

I agree.
Should I rename all related names of functions and variables in the patch?

Please rename them when convenient.

--
Peter Geoghegan

#69

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#67)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

That sounds like the right way to do it.

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

That makes sense, since _bt_split() is responsible for both splitting
the page, and inserting the new item on either the left or right page,
as part of the first phase of a page split. In other words, if you're
adding something new to _bt_insertonpg(), you probably also need to
add something new to _bt_split(). So that's what you did.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.

In the meantime, I'll run more stress-tests.

As you probably realize, wal_consistency_checking is a good thing to
use with your tests here.

--
Peter Geoghegan

#70

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#69)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

20.08.2019 4:04, Peter Geoghegan wrote:

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.

Thank you for the review.

The new version v8 is attached. Compared to previous version, this patch
includes
updated btree_xlog_insert() and btree_xlog_split() so that WAL records
now only contain data
about updated posting tuple and don't require full page writes.
I haven't updated pg_waldump yet. It is postponed until we agree on
nbtxlog changes.

Also in this patch I renamed all 'compress' keywords to 'deduplicate'
and did minor cleanup
of outdated comments.

I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v8-0001-Deduplication-in-nbtree.patchtext/x-patch; name=v8-0001-Deduplication-in-nbtree.patchDownload

commit d73c1b8e10177dfb55ff1b1bac999f85d2a0298d
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date:   Wed Aug 21 20:00:54 2019 +0300

    v8-0001-Deduplication-in-nbtree.patch

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..ddc511a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation.  Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * deduplication will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 48d19be..9af59c1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,15 +47,17 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
-						   bool split_only_page);
+						   bool split_only_page, int in_posting_offset);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple lefttup, IndexTuple righttup);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTDeduplicateState *deduplicateState);
+static void _bt_deduplicate_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +299,13 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, newitemoff, false,
+					   insertstate.in_posting_offset);
 	}
 	else
 	{
@@ -435,6 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -759,6 +765,26 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to deduplicate the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+		{
+			_bt_deduplicate_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;		/* paranoia */
+
+			/*
+			 * FIXME: _bt_vacuum_one_page() won't have cleared the
+			 * BTP_HAS_GARBAGE flag when it didn't kill items.  Maybe we
+			 * should clear the BTP_HAS_GARBAGE flag bit from the page when
+			 * deduplication avoids a page split -- _bt_vacuum_one_page() is
+			 * expecting a page split that takes care of it.
+			 *
+			 * (On the other hand, maybe it doesn't matter very much.  A
+			 * comment update seems like the bare minimum we should do.)
+			 */
+		}
 	}
 	else
 	{
@@ -900,6 +926,75 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ *
+ * If original posting tuple was split, 'newitup' represents left part of
+ * original tuple and 'newitupright' is it's right part, that must be inserted
+ * next to newitemoff.
+ * It's essential to do this atomic to be crash safe.
+ *
+ * NOTE All checks of free space must be done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+void
+_bt_delete_and_insert(Buffer buf,
+					  Page page,
+					  IndexTuple newitup, IndexTuple newitupright,
+					  OffsetNumber newitemoff, bool need_xlog)
+{
+	Size		newitupsz = IndexTupleSize(newitup);
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, newitemoff);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+		elog(ERROR, "failed to insert posting item in index");
+
+	if (newitupright)
+	{
+		if (!_bt_pgaddtup(page, MAXALIGN(IndexTupleSize(newitupright)),
+						  newitupright, OffsetNumberNext(newitemoff)))
+			elog(ERROR, "failed to insert posting item in index");
+	}
+
+	if (BufferIsValid(buf))
+	{
+		MarkBufferDirty(buf);
+
+		/* Xlog stuff */
+		if (need_xlog)
+		{
+			xl_btree_insert xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.offnum = newitemoff;
+			xlrec.righttupoffset = 1;
+			if (newitupright)
+				xlrec.righttupoffset = IndexTupleSize(newitup);
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+			Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(0, (char *) newitup, IndexTupleSize(newitup));
+			if (newitupright)
+				XLogRegisterBufData(0, (char *) newitupright, IndexTupleSize(newitupright));
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+
+			PageSetLSN(page, recptr);
+		}
+	}
+	END_CRIT_SECTION();
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -936,11 +1031,16 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   int in_posting_offset)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	origtup;
+	IndexTuple	neworigtup = NULL;
+	IndexTuple	newrighttup = NULL;
+	bool		need_split = false;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -965,13 +1065,184 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * If new tuple's key is equal to the key of a posting tuple that already
+	 * exists on the page and it's TID falls inside the min/max range of
+	 * existing posting list, update the posting tuple.
+	 *
+	 * TODO Think of moving this to a separate function.
+	 *
+	 * TODO possible optimization:
+	 *		if original posting tuple is dead,
+	 *		reset in_posting_offset and handle itup as a regular tuple
+	 */
+	if (in_posting_offset)
+	{
+		/* get old posting tuple */
+		ItemId 			itemid = PageGetItemId(page, newitemoff);
+		ItemPointerData *ipd;
+		int				nipd, nipd_right;
+		bool			need_posting_split = false;
+
+		origtup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPosting(origtup));
+		nipd = BTreeTupleGetNPosting(origtup);
+		Assert(in_posting_offset < nipd);
+		Assert(itup_key->scantid != NULL);
+		Assert(itup_key->heapkeyspace);
+
+		elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+		/* check if posting tuple must be splitted */
+		if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + sizeof(ItemPointerData))
+			need_posting_split = true;
+
+		/*
+		 * If page split is needed, always split posting tuple.
+		 * Probably that is not the most optimal,
+		 * but it allows to simplify _bt_split code.
+		 *
+		 * TODO Does this decision have any significant drawbacks?
+		 */
+		if (PageGetFreeSpace(page) < sizeof(ItemPointerData))
+			need_posting_split = true;
+
+		/*
+		 * Handle corner cases (1)
+		 *		- itup TID is smaller than leftmost orightup TID
+		 */
+		if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+								BTreeTupleGetHeapTID(origtup)) < 0)
+		{
+			if (need_posting_split)
+			{
+				/*
+				 * cannot avoid split, so no need in trying to fit itup into posting list.
+				 * handle itup insertion as regular tuple insertion
+				 */
+				elog(DEBUG4, "split posting tuple. itup is to the left of origtup");
+				in_posting_offset = InvalidOffsetNumber;
+				newitemoff = OffsetNumberPrev(newitemoff);
+			}
+			else
+			{
+				ipd = palloc0(nipd + 1);
+				/* insert new item pointer */
+				memcpy(ipd, itup, sizeof(ItemPointerData));
+				/* copy item pointers from original tuple that belong on right */
+				memcpy(ipd + 1, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+				neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+				pfree(ipd);
+
+				Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+										  BTreeTupleGetMaxTID(neworigtup)) < 0);
+			}
+		}
+
+		/*
+		 * Handle corner cases (2)
+		 *		- itup TID is larger than rightmost orightup TID
+		 */
+		if (ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+							   BTreeTupleGetHeapTID(itup)) < 0)
+		{
+			if (need_posting_split)
+			{
+				/*
+				 * cannot avoid split, so no need in trying to fit itup into posting list.
+				 * handle itup insertion as regular tuple insertion
+				 */
+				elog(DEBUG4, "split posting tuple. itup is to the right of origtup");
+				in_posting_offset = InvalidOffsetNumber;
+			}
+			else
+			{
+				ipd = palloc0(nipd + 1);
+				/* insert new item pointer */
+				/* copy item pointers from original tuple that belong on right */
+				memcpy(ipd, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+				memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+				neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+				pfree(ipd);
+
+				Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+										  BTreeTupleGetMaxTID(neworigtup)) < 0);
+			}
+		}
+
+		/*
+		 * itup TID belongs to TID range of origtup posting list
+		 *
+		 * Split posting tuple into two halves.
+		 *
+		 * neworigtup (left) tuple contains all item pointers less than the new one and
+		 * newrighttup tuple contains new item pointer and all to the right.
+		 */
+		if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+							   BTreeTupleGetHeapTID(origtup)) > 0
+			&&
+			ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+							   BTreeTupleGetHeapTID(itup)) > 0)
+		{
+			neworigtup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+											in_posting_offset);
+
+			nipd_right = nipd - in_posting_offset + 1;
+
+			elog(DEBUG4, "split posting tuple in_posting_offset %d nipd %d nipd_right %d",
+						 in_posting_offset, nipd, nipd_right);
+
+			ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+			/* insert new item pointer */
+			memcpy(ipd, itup, sizeof(ItemPointerData));
+			/* copy item pointers from original tuple that belong on right */
+			memcpy(ipd + 1,
+				BTreeTupleGetPostingN(origtup, in_posting_offset),
+				sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+			newrighttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+
+			Assert(ItemPointerCompare(BTreeTupleGetMaxTID(neworigtup),
+									BTreeTupleGetHeapTID(newrighttup)) < 0);
+			pfree(ipd);
+
+			elog(DEBUG4, "left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+				BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+				BTreeTupleIsPosting(newrighttup)?BTreeTupleGetNPosting(newrighttup):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)));
+
+			/*
+				* check if splitted tuple still fit into original page
+				* TODO should we add sizeof(ItemIdData) in this check?
+				*/
+			if (PageGetFreeSpace(page) < (MAXALIGN(IndexTupleSize(neworigtup))
+											+ MAXALIGN(IndexTupleSize(newrighttup))
+											- MAXALIGN(IndexTupleSize(origtup))))
+				need_split = true;
+		}
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
 	 * so this comparison is correct even though we appear to be accounting
 	 * only for the item and not for its line pointer.
 	 */
-	if (PageGetFreeSpace(page) < itemsz)
+	if (PageGetFreeSpace(page) < itemsz || need_split)
 	{
 		bool		is_root = P_ISROOT(lpageop);
 		bool		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
@@ -996,7 +1267,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 neworigtup, newrighttup);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1033,142 +1305,161 @@ _bt_insertonpg(Relation rel,
 		itup_off = newitemoff;
 		itup_blkno = BufferGetBlockNumber(buf);
 
-		/*
-		 * If we are doing this insert because we split a page that was the
-		 * only one on its tree level, but was not the root, it may have been
-		 * the "fast root".  We need to ensure that the fast root link points
-		 * at or above the current page.  We can safely acquire a lock on the
-		 * metapage here --- see comments for _bt_newroot().
-		 */
-		if (split_only_page)
+		if (!in_posting_offset)
 		{
-			Assert(!P_ISLEAF(lpageop));
-
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
-			metapg = BufferGetPage(metabuf);
-			metad = BTPageGetMeta(metapg);
-
-			if (metad->btm_fastlevel >= lpageop->btpo.level)
+			/*
+			* If we are doing this insert because we split a page that was the
+			* only one on its tree level, but was not the root, it may have been
+			* the "fast root".  We need to ensure that the fast root link points
+			* at or above the current page.  We can safely acquire a lock on the
+			* metapage here --- see comments for _bt_newroot().
+			*/
+			if (split_only_page)
 			{
-				/* no update wanted */
-				_bt_relbuf(rel, metabuf);
-				metabuf = InvalidBuffer;
-			}
-		}
-
-		/*
-		 * Every internal page should have exactly one negative infinity item
-		 * at all times.  Only _bt_split() and _bt_newroot() should add items
-		 * that become negative infinity items through truncation, since
-		 * they're the only routines that allocate new internal pages.  Do not
-		 * allow a retail insertion of a new item at the negative infinity
-		 * offset.
-		 */
-		if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
-			elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
-				 itup_blkno, RelationGetRelationName(rel));
+				Assert(!P_ISLEAF(lpageop));
 
-		/* Do the update.  No ereport(ERROR) until changes are logged */
-		START_CRIT_SECTION();
+				metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+				metapg = BufferGetPage(metabuf);
+				metad = BTPageGetMeta(metapg);
 
-		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
-			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
-				 itup_blkno, RelationGetRelationName(rel));
+				if (metad->btm_fastlevel >= lpageop->btpo.level)
+				{
+					/* no update wanted */
+					_bt_relbuf(rel, metabuf);
+					metabuf = InvalidBuffer;
+				}
+			}
 
-		MarkBufferDirty(buf);
+			/*
+			* Every internal page should have exactly one negative infinity item
+			* at all times.  Only _bt_split() and _bt_newroot() should add items
+			* that become negative infinity items through truncation, since
+			* they're the only routines that allocate new internal pages.  Do not
+			* allow a retail insertion of a new item at the negative infinity
+			* offset.
+			*/
+			if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
+				elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
+					itup_blkno, RelationGetRelationName(rel));
+
+			/* Do the update.  No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
+			if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
+				elog(PANIC, "failed to add new item to block %u in index \"%s\"",
+					itup_blkno, RelationGetRelationName(rel));
+
+			MarkBufferDirty(buf);
 
-		if (BufferIsValid(metabuf))
-		{
-			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_NOVAC_VERSION)
-				_bt_upgrademetapage(metapg);
-			metad->btm_fastroot = itup_blkno;
-			metad->btm_fastlevel = lpageop->btpo.level;
-			MarkBufferDirty(metabuf);
-		}
+			if (BufferIsValid(metabuf))
+			{
+				/* upgrade meta-page if needed */
+				if (metad->btm_version < BTREE_NOVAC_VERSION)
+					_bt_upgrademetapage(metapg);
+				metad->btm_fastroot = itup_blkno;
+				metad->btm_fastlevel = lpageop->btpo.level;
+				MarkBufferDirty(metabuf);
+			}
 
-		/* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
-		if (BufferIsValid(cbuf))
-		{
-			Page		cpage = BufferGetPage(cbuf);
-			BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+			/* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
+			if (BufferIsValid(cbuf))
+			{
+				Page		cpage = BufferGetPage(cbuf);
+				BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
 
-			Assert(P_INCOMPLETE_SPLIT(cpageop));
-			cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
-			MarkBufferDirty(cbuf);
-		}
+				Assert(P_INCOMPLETE_SPLIT(cpageop));
+				cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+				MarkBufferDirty(cbuf);
+			}
 
-		/*
-		 * Cache the block information if we just inserted into the rightmost
-		 * leaf page of the index and it's not the root page.  For very small
-		 * index where root is also the leaf, there is no point trying for any
-		 * optimization.
-		 */
-		if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
-			cachedBlock = BufferGetBlockNumber(buf);
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_btree_insert xlrec;
+				xl_btree_metadata xlmeta;
+				uint8		xlinfo;
+				XLogRecPtr	recptr;
 
-		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
-		{
-			xl_btree_insert xlrec;
-			xl_btree_metadata xlmeta;
-			uint8		xlinfo;
-			XLogRecPtr	recptr;
+				xlrec.offnum = itup_off;
+				xlrec.righttupoffset = 0;
 
-			xlrec.offnum = itup_off;
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+				if (P_ISLEAF(lpageop))
+					xlinfo = XLOG_BTREE_INSERT_LEAF;
+				else
+				{
+					/*
+					* Register the left child whose INCOMPLETE_SPLIT flag was
+					* cleared.
+					*/
+					XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
 
-			if (P_ISLEAF(lpageop))
-				xlinfo = XLOG_BTREE_INSERT_LEAF;
-			else
-			{
-				/*
-				 * Register the left child whose INCOMPLETE_SPLIT flag was
-				 * cleared.
-				 */
-				XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
+					xlinfo = XLOG_BTREE_INSERT_UPPER;
+				}
 
-				xlinfo = XLOG_BTREE_INSERT_UPPER;
-			}
+				if (BufferIsValid(metabuf))
+				{
+					Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+					xlmeta.version = metad->btm_version;
+					xlmeta.root = metad->btm_root;
+					xlmeta.level = metad->btm_level;
+					xlmeta.fastroot = metad->btm_fastroot;
+					xlmeta.fastlevel = metad->btm_fastlevel;
+					xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+					xlmeta.last_cleanup_num_heap_tuples =
+						metad->btm_last_cleanup_num_heap_tuples;
+
+					XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+					XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
+
+					xlinfo = XLOG_BTREE_INSERT_META;
+				}
 
-			if (BufferIsValid(metabuf))
-			{
-				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
-				xlmeta.version = metad->btm_version;
-				xlmeta.root = metad->btm_root;
-				xlmeta.level = metad->btm_level;
-				xlmeta.fastroot = metad->btm_fastroot;
-				xlmeta.fastlevel = metad->btm_fastlevel;
-				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
-				xlmeta.last_cleanup_num_heap_tuples =
-					metad->btm_last_cleanup_num_heap_tuples;
-
-				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
-				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
-
-				xlinfo = XLOG_BTREE_INSERT_META;
-			}
+				XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
 
-			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+				recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
-			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+				if (BufferIsValid(metabuf))
+				{
+					PageSetLSN(metapg, recptr);
+				}
+				if (BufferIsValid(cbuf))
+				{
+					PageSetLSN(BufferGetPage(cbuf), recptr);
+				}
 
-			if (BufferIsValid(metabuf))
-			{
-				PageSetLSN(metapg, recptr);
-			}
-			if (BufferIsValid(cbuf))
-			{
-				PageSetLSN(BufferGetPage(cbuf), recptr);
+				PageSetLSN(page, recptr);
 			}
 
-			PageSetLSN(page, recptr);
+			END_CRIT_SECTION();
+		}
+		else
+		{
+			/*
+			 * Insert new tuple on place of existing posting tuple.
+			 * Delete old posting tuple, and insert updated tuple instead.
+			 *
+			 * If split was needed, both neworigtup and newrighttup are initialized
+			 * and both will be inserted, otherwise newrighttup is NULL.
+			 *
+			 * It only can happen on leaf page.
+			 */
+			elog(DEBUG4, "_bt_insertonpg. _bt_delete_and_insert %s",  RelationGetRelationName(rel));
+			_bt_delete_and_insert(buf, page, neworigtup,
+								  newrighttup, newitemoff, RelationNeedsWAL(rel));
 		}
 
-		END_CRIT_SECTION();
+		/*
+		 * Cache the block information if we just inserted into the rightmost
+		 * leaf page of the index and it's not the root page.  For very small
+		 * index where root is also the leaf, there is no point trying for any
+		 * optimization.
+		 */
+		if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
+				cachedBlock = BufferGetBlockNumber(buf);
 
 		/* release buffers */
 		if (BufferIsValid(metabuf))
@@ -1214,7 +1505,8 @@ _bt_insertonpg(Relation rel,
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple lefttup, IndexTuple righttup)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1528,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replaceitemoff = InvalidOffsetNumber;
+	Size		replaceitemsz;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1537,24 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * If we're working with splitted posting tuple,
+	 * new tuple is actually contained in righttup posting list
+	 */
+	if (righttup)
+	{
+		newitem = righttup;
+		newitemsz = MAXALIGN(IndexTupleSize(righttup));
+
+		/*
+		 * actual insertion is a replacement of origtup with lefttup
+		 * and insertion of righttup (as newitem) next to it.
+		 */
+		replaceitemoff = newitemoff;
+		replaceitemsz = MAXALIGN(IndexTupleSize(lefttup));
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1275,7 +1587,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * (but not always) redundant information.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
-								  newitem, &newitemonleft);
+								  newitem, replaceitemoff, replaceitemsz,
+								  lefttup, &newitemonleft);
 
 	/* Allocate temp buffer for leftpage */
 	leftpage = PageGetTempPage(origpage);
@@ -1364,6 +1677,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			/* incoming tuple will become last on left page */
 			lastleft = newitem;
 		}
+		else if (!newitemonleft && newitemoff == firstright && lefttup)
+		{
+			/*
+			 * if newitem is first on the right page
+			 * and split posting tuple handle is reuqired,
+			 * lastleft will be replaced with lefttup,
+			 * so use it here
+			 */
+			elog(DEBUG4, "lastleft = lefttup firstright %d", firstright);
+			lastleft = lefttup;
+		}
 		else
 		{
 			OffsetNumber lastleftoff;
@@ -1480,6 +1804,39 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		if (i == replaceitemoff)
+		{
+			if (replaceitemoff <= firstright)
+			{
+				elog(DEBUG4, "_bt_split left. replaceitem block %u %s replaceitemoff %d leftoff %d", 
+					origpagenumber, RelationGetRelationName(rel), replaceitemoff, leftoff);
+				if (!_bt_pgaddtup(leftpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, leftoff))
+				{
+					memset(rightpage, 0, BufferGetPageSize(rbuf));
+					elog(ERROR, "failed to add new item to the left sibling"
+						 " while splitting block %u of index \"%s\"",
+						 origpagenumber, RelationGetRelationName(rel));
+				}
+				leftoff = OffsetNumberNext(leftoff);
+			}
+			else
+			{
+				elog(DEBUG4, "_bt_split right. replaceitem block %u %s replaceitemoff %d newitemoff %d", 
+					 origpagenumber, RelationGetRelationName(rel), replaceitemoff, newitemoff);
+				elog(DEBUG4, "_bt_split right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
+
+				if (!_bt_pgaddtup(rightpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, rightoff))
+				{
+					memset(rightpage, 0, BufferGetPageSize(rbuf));
+					elog(ERROR, "failed to add new item to the right sibling"
+						 " while splitting block %u of index \"%s\", rightoff %d",
+						 origpagenumber, RelationGetRelationName(rel), rightoff);
+				}
+				rightoff = OffsetNumberNext(rightoff);
+			}
+			continue;
+		}
+
 		/* does new item belong before this one? */
 		if (i == newitemoff)
 		{
@@ -1497,13 +1854,14 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			}
 			else
 			{
+				elog(DEBUG4, "insert newitem to the right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
 				Assert(newitemoff >= firstright);
 				if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
 				{
 					memset(rightpage, 0, BufferGetPageSize(rbuf));
 					elog(ERROR, "failed to add new item to the right sibling"
-						 " while splitting block %u of index \"%s\"",
-						 origpagenumber, RelationGetRelationName(rel));
+						 " while splitting block %u of index \"%s\", rightoff %d",
+						 origpagenumber, RelationGetRelationName(rel), rightoff);
 				}
 				rightoff = OffsetNumberNext(rightoff);
 			}
@@ -1547,8 +1905,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		{
 			memset(rightpage, 0, BufferGetPageSize(rbuf));
 			elog(ERROR, "failed to add new item to the right sibling"
-				 " while splitting block %u of index \"%s\"",
-				 origpagenumber, RelationGetRelationName(rel));
+				 " while splitting block %u of index \"%s\" rightoff %d",
+				 origpagenumber, RelationGetRelationName(rel), rightoff);
 		}
 		rightoff = OffsetNumberNext(rightoff);
 	}
@@ -1652,6 +2010,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.replaceitemoff = replaceitemoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1681,6 +2040,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
+		if (replaceitemoff)
+			XLogRegisterBufData(0, (char *) lefttup, MAXALIGN(IndexTupleSize(lefttup)));
+
 		/*
 		 * Log the contents of the right page in the format understood by
 		 * _bt_restore_page().  The whole right page will be recreated.
@@ -1835,7 +2197,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
-					   is_only);
+					   is_only, 0);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2304,3 +2666,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (posting or not) to the page, while applying deduplication
+ * to it.
+ */
+static void
+insert_itupprev_to_page(Page page, BTDeduplicateState *deduplicateState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (deduplicateState->ntuples == 0)
+		to_insert = deduplicateState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(deduplicateState->itupprev,
+											 deduplicateState->ipd,
+											 deduplicateState->ntuples);
+		to_insert = postingtuple;
+		pfree(deduplicateState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. deduplicateState->ntuples %d IndexTupleSize %zu free %zu",
+		 deduplicateState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "failed to add tuple to page while compresing it");
+
+	if (deduplicateState->ntuples > 0)
+		pfree(to_insert);
+	deduplicateState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to deduplicate items to free some space.
+ *
+ * If deduplication was not applied, buffer contains old state of the page.
+ *
+ * It's expected that this function is called after lp_dead items were
+ * removed by _bt_vacuum_one_page(). In case some dead items are still left,
+ * this function cleans them up before applying deduplication.
+ */
+static void
+_bt_deduplicate_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_deduplication = false;
+	BTDeduplicateState *deduplicateState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_deduplication = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+					   IndexRelationGetNumberOfAttributes(rel) &&
+					   !rel->rd_index->indisunique);
+	if (!use_deduplication)
+		return;
+
+	/* init state needed to build posting tuples */
+	deduplicateState = (BTDeduplicateState *) palloc0(sizeof(BTDeduplicateState));
+	deduplicateState->ipd = NULL;
+	deduplicateState->ntuples = 0;
+	deduplicateState->itupprev = NULL;
+	deduplicateState->maxitemsize = BTMaxItemSize(page);
+	deduplicateState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any.
+	 * We cannot simply skip them in the cycle below, because it's neccessary
+	 * to generate special Xlog record containing such tuples to compute
+	 * latestRemovedXid on a standby server later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and  _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemid = PageGetItemId(page, P_HIKEY);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_deduplicate_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to collect them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		if (deduplicateState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, deduplicateState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (deduplicateState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(deduplicateState->itupprev)
+							   + (deduplicateState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(deduplicateState, itup);
+				}
+				else
+				{
+					insert_itupprev_to_page(newpage, deduplicateState);
+				}
+			}
+			else
+			{
+				insert_itupprev_to_page(newpage, deduplicateState);
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (deduplicateState->itupprev)
+			pfree(deduplicateState->itupprev);
+		deduplicateState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(deduplicateState->itupprev) <= deduplicateState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	insert_itupprev_to_page(newpage, deduplicateState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_deduplicate_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 18c6de2..bd41592 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,78 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1328,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1431,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed2..6282c6b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -497,7 +500,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid,
+									 &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -526,6 +530,60 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+					BTScanInsert key,
+					Page page,
+					OffsetNumber offnum,
+					int *in_posting_offset)
+{
+	IndexTuple	itup;
+	int			result;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	result = _bt_compare(rel, key, page, offnum);
+
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int			low,
+					high,
+					mid,
+					res;
+
+		low = 0;
+		/* "high" is past end of posting list for loop invariant */
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid,
+									 BTreeTupleGetPostingN(itup, mid));
+
+			if (res >= 1)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+
+		*in_posting_offset = high;
+		elog(DEBUG4, "_bt_compare_posting in_posting_offset %d", *in_posting_offset);
+		Assert(ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+							  key->scantid) < 0);
+		Assert(ItemPointerCompare(key->scantid,
+							  BTreeTupleGetMaxTID(itup)) < 0);
+	}
+
+	return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -658,61 +716,120 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex. It is
+	 * possible that the scankey belongs to the tuple's posting list TID
+	 * range.
+	 *
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+	 * matches tuple at this offset.
+	 *
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			 * Most searches have a scankey that is considered greater than a
+			 * truncated pivot tuple if and when the scankey has equal values
+			 * for attributes up to and including the least significant
+			 * untruncated attribute in tuple.
+			 *
+			 * For example, if an index has the minimum two attributes (single
+			 * user key attribute, plus heap TID attribute), and a page's high
+			 * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+			 * search will not descend to the page to the left.  The search
+			 * will descend right instead.  The truncated attribute in pivot
+			 * tuple means that all non-pivot tuples on the page to the left
+			 * are strictly < 'foo', so it isn't necessary to descend left. In
+			 * other words, search doesn't have to descend left because it
+			 * isn't interested in a match that has a heap TID value of -inf.
+			 *
+			 * However, some searches (pivotsearch searches) actually require
+			 * that we descend left when this happens.  -inf is treated as a
+			 * possible match for omitted scankey attribute(s).  This is
+			 * needed by page deletion, which must re-find leaf pages that are
+			 * targets for deletion using their high keys.
+			 *
+			 * Note: the heap TID part of the test ensures that scankey is
+			 * being compared to a pivot tuple with one or more truncated key
+			 * attributes.
+			 *
+			 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+			 * the left here, since they have no heap TID attribute (and
+			 * cannot have any -inf key values in any case, since truncation
+			 * can only remove non-key attributes).  !heapkeyspace searches
+			 * must always be prepared to deal with matches on both sides of
+			 * the pivot once the leaf level is reached.
+			 */
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * Treat truncated heap TID as minus infinity, since scankey has a key
+		 * attribute value (scantid) that would otherwise be compared directly
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
 	}
+	else
+	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int			cmp = ItemPointerCompare(key->scantid, heapTid);
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
 
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			/*
+			 * if we got here, scantid is inbetween of posting items of the
+			 * tuple
+			 */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+				 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+				 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+				 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetBlockNumberNoCheck(heapTid),
+				 ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
+	}
+
+	return 0;
 }
 
 /*
@@ -1449,6 +1566,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1483,8 +1601,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1517,7 +1649,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1525,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1567,8 +1699,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					/* XXX: Maybe this loop should be backwards? */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1582,8 +1729,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1596,6 +1743,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1608,6 +1757,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e678690..7c3a42b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDeduplicateState *deduplicateState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -963,6 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1051,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1128,6 +1137,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDeduplicateState *deduplicateState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (deduplicateState->ntuples == 0)
+		to_insert = deduplicateState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(deduplicateState->itupprev,
+											 deduplicateState->ipd,
+											 deduplicateState->ntuples);
+		to_insert = postingtuple;
+		pfree(deduplicateState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (deduplicateState->ntuples > 0)
+		pfree(to_insert);
+	deduplicateState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in deduplicateState.
+ *
+ * Helper function for _bt_load() and _bt_deduplicate_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTDeduplicateState *deduplicateState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (deduplicateState->ntuples == 0)
+	{
+		deduplicateState->ipd = palloc0(deduplicateState->maxitemsize);
+
+		if (BTreeTupleIsPosting(deduplicateState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(deduplicateState->itupprev);
+			memcpy(deduplicateState->ipd,
+				   BTreeTupleGetPosting(deduplicateState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			deduplicateState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(deduplicateState->ipd, deduplicateState->itupprev,
+				   sizeof(ItemPointerData));
+			deduplicateState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(deduplicateState->ipd + deduplicateState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		deduplicateState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(deduplicateState->ipd + deduplicateState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		deduplicateState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1141,9 +1235,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_deduplication = false;
+	BTDeduplicateState *deduplicateState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_deduplication = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+					   IndexRelationGetNumberOfAttributes(wstate->index) &&
+					   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1362,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_deduplication)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init state needed to build posting tuples */
+			deduplicateState = (BTDeduplicateState *) palloc0(sizeof(BTDeduplicateState));
+			deduplicateState->ipd = NULL;
+			deduplicateState->ntuples = 0;
+			deduplicateState->itupprev = NULL;
+			deduplicateState->maxitemsize = 0;
+			deduplicateState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					deduplicateState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (deduplicateState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   deduplicateState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((deduplicateState->ntuples + 1) * sizeof(ItemPointerData) <
+							deduplicateState->maxpostingsize)
+							_bt_add_posting_item(deduplicateState, itup);
+						else
+							_bt_buildadd_posting(wstate, state,
+												 deduplicateState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, deduplicateState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (deduplicateState->itupprev)
+					pfree(deduplicateState->itupprev);
+				deduplicateState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				deduplicateState->maxpostingsize = deduplicateState->maxitemsize -
+					IndexInfoFindDataOffset(deduplicateState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(deduplicateState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, deduplicateState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..c492b04 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -62,6 +62,11 @@ typedef struct
 	int			nsplits;		/* current number of splits */
 	SplitPoint *splits;			/* all candidate split points for page */
 	int			interval;		/* current range of acceptable split points */
+
+	/* fields only valid when insert splitted posting tuple */
+	OffsetNumber replaceitemoff;
+	IndexTuple	 replaceitem;
+	Size		 replaceitemsz;
 } FindSplitData;
 
 static void _bt_recsplitloc(FindSplitData *state,
@@ -129,6 +134,9 @@ _bt_findsplitloc(Relation rel,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
 				 IndexTuple newitem,
+				 OffsetNumber replaceitemoff,
+				 Size replaceitemsz,
+				 IndexTuple replaceitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -183,6 +191,10 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	state.replaceitemoff = replaceitemoff;
+	state.replaceitemsz = replaceitemsz;
+	state.replaceitem = replaceitem;
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -207,7 +219,17 @@ _bt_findsplitloc(Relation rel,
 		Size		itemsz;
 
 		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/* use size of replacing item for calculations */
+		if (offnum == replaceitemoff)
+		{
+			itemsz = replaceitemsz + sizeof(ItemIdData);
+			olddataitemstotal = state.olddataitemstotal = state.olddataitemstotal
+									  - MAXALIGN(ItemIdGetLength(itemid))
+									  + replaceitemsz;
+		}
+		else
+			itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
 
 		/*
 		 * When item offset number is not newitemoff, neither side of the
@@ -466,9 +488,13 @@ _bt_recsplitloc(FindSplitData *state,
 							 && !newitemonleft);
 
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
+	}
 
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
@@ -492,12 +518,12 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
-	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
-	else
-		leftfree -= (int16) firstrightitemsz;
+	leftfree -= (int16) firstrightitemsz;
 
 	/* account for the new item */
 	if (newitemonleft)
@@ -1066,13 +1092,20 @@ static inline IndexTuple
 _bt_split_lastleft(FindSplitData *state, SplitPoint *split)
 {
 	ItemId		itemid;
+	OffsetNumber offset;
 
 	if (split->newitemonleft && split->firstoldonright == state->newitemoff)
 		return state->newitem;
 
-	itemid = PageGetItemId(state->page,
-						   OffsetNumberPrev(split->firstoldonright));
-	return (IndexTuple) PageGetItem(state->page, itemid);
+	offset = OffsetNumberPrev(split->firstoldonright);
+	if (offset == state->replaceitemoff)
+		return state->replaceitem;
+	else
+	{
+		itemid = PageGetItemId(state->page,
+							OffsetNumberPrev(split->firstoldonright));
+		return (IndexTuple) PageGetItem(state->page, itemid);
+	}
 }
 
 /*
@@ -1086,6 +1119,11 @@ _bt_split_firstright(FindSplitData *state, SplitPoint *split)
 	if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
 		return state->newitem;
 
-	itemid = PageGetItemId(state->page, split->firstoldonright);
-	return (IndexTuple) PageGetItem(state->page, itemid);
+	if (split->firstoldonright == state->replaceitemoff)
+		return state->replaceitem;
+	else
+	{
+		itemid = PageGetItemId(state->page, split->firstoldonright);
+		return (IndexTuple) PageGetItem(state->page, itemid);
+	}
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..c506cca 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->nextkey = false;
 	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+	else
+		key->scantid = NULL;
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2112,6 +2118,7 @@ btbuildphasename(int64 phasenum)
  * returning an enlarged tuple to caller when truncation + TOAST compression
  * ends up enlarging the final datum.
  */
+
 IndexTuple
 _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 			 BTScanInsert itup_key)
@@ -2124,6 +2131,17 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	ItemPointer pivotheaptid;
 	Size		newsize;
 
+	elog(DEBUG4, "_bt_truncate left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+				BTreeTupleIsPosting(lastleft)?BTreeTupleGetNPosting(lastleft):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+				BTreeTupleIsPosting(firstright)?BTreeTupleGetNPosting(firstright):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(firstright)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(firstright)));
 	/*
 	 * We should only ever truncate leaf index tuples.  It's never okay to
 	 * truncate a second time.
@@ -2145,6 +2163,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2189,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2198,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a posting tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2256,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2267,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2285,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2294,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2385,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff.  For example, non-deterministic collations
+ * cannot use deduplication, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case.  Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8.  As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that).  In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts().  This needs to be nailed down
+ * in some way.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2489,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2544,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2571,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2623,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2651,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..06ac688 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -174,16 +174,39 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 	 */
 	if (!isleaf)
 		_bt_clear_incomplete_split(record, 1);
+
 	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
 	{
-		Size		datalen;
-		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
-
 		page = BufferGetPage(buffer);
+		if (isleaf && xlrec->righttupoffset)
+		{
+			Size		datalen, lefttuplen;
+			char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+			IndexTuple lefttup = NULL;
+			IndexTuple righttup = NULL;
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+			lefttup = (IndexTuple) datapos;
+
+			if (xlrec->righttupoffset > 1)
+			{
+				lefttuplen = xlrec->righttupoffset;
+				righttup = (IndexTuple) (datapos + lefttuplen);
+			}
+			else
+				lefttuplen = datalen;
+
+			_bt_delete_and_insert(InvalidBuffer, page,
+								  lefttup, righttup, xlrec->offnum, false);
+		}
+		else
+		{
+			Size		datalen;
+			char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,9 +288,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					replaceitem = NULL;
 		Size		newitemsz = 0,
-					left_hikeysz = 0;
+					left_hikeysz = 0,
+					replaceitemsz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -287,6 +312,13 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		datapos += left_hikeysz;
 		datalen -= left_hikeysz;
 
+		if (xlrec->replaceitemoff)
+		{
+			replaceitem = (IndexTuple) datapos;
+			replaceitemsz = MAXALIGN(IndexTupleSize(replaceitem));
+			datapos += replaceitemsz;
+			datalen -= replaceitemsz;
+		}
 		Assert(datalen == 0);
 
 		newlpage = PageGetTempPageCopySpecial(lpage);
@@ -304,6 +336,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == xlrec->replaceitemoff)
+			{
+				if (PageAddItem(newlpage, (Item) replaceitem, replaceitemsz, leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -386,8 +427,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +519,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
+
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7e54c45..d76fbe9 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of deduplication never applies to unique indexes or indexes
+ * with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDeduplicateState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTDeduplicateState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID.  The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID().  The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -497,6 +658,12 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -563,6 +730,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -575,7 +744,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -729,12 +898,17 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_delete_and_insert(Buffer buf, Page page,
+					  IndexTuple newitup, IndexTuple newitupright,
+					  OffsetNumber newitemoff, bool need_xlog);
 
 /*
  * prototypes for functions in nbtsplitloc.c
  */
 extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 									 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+									 OffsetNumber replaceitemoff, Size replaceitemsz,
+									 IndexTuple replaceitem,
 									 bool *newitemonleft);
 
 /*
@@ -759,6 +933,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -771,6 +947,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+								 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -809,6 +987,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -821,5 +1002,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTDeduplicateState *deduplicateState,
+								 IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..312e780 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -64,13 +64,16 @@ typedef struct xl_btree_metadata
  * Backup Blk 0: original page (data contains the inserted tuple)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * INSERT_LEAF case
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	Size	righttupoffset;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, righttupoffset) + sizeof(Size))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -113,9 +116,10 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (if placed on left page) */
+	OffsetNumber replaceitemoff; /* offset of the posting item to replace with (replaceitem) */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, replaceitemoff) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +177,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

#71

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#70)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification? I see this
problem with v8, with the TPC-H test data:

DEBUG: finished verifying presence of 1500000 tuples from table
"customer" with bitset 51.09% set
ERROR: could not find tuple using search from root page in index
"idx_customer_nationkey2"

I've been running my standard amcheck query with these databases, which is:

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true,
rootdescend => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:

regression=# \l+
                                                                List
of databases
    Name    | Owner | Encoding |  Collate   |   Ctype    | Access
privileges |  Size   | Tablespace |                Description
------------+-------+----------+------------+------------+-------------------+---------+------------+--------------------------------------------
 land       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 6425 MB | pg_default |
 mgd        | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 61 GB   | pg_default |
 postgres   | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 7753 kB | pg_default | default administrative connection
database
 regression | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 886 MB  | pg_default |
 template0  | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/pg
     +| 7609 kB | pg_default | unmodifiable empty database
            |       |          |            |            | pg=CTc/pg
      |         |            |
 template1  | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/pg
     +| 7609 kB | pg_default | default template for new databases
            |       |          |            |            | pg=CTc/pg
      |         |            |
 tpcc       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 10 GB   | pg_default |
 tpce       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 26 GB   | pg_default |
 tpch       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 32 GB   | pg_default |
(9 rows)

I have found it very valuable to use this test data when changing
nbtsplitloc.c, or anything that could affect where page splits make
free space available. If this is too much data to handle conveniently,
then you could skip "mgd" and almost have as much test coverage. There
really does seem to be a benefit to using diverse test cases like
this, because sometimes regressions only affect a small number of
specific indexes for specific reasons. For example, only TPC-H has a
small number of indexes that have tuples that are inserted in order,
but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD
stuff really helped with those indexes.

Want me to send this data and the associated tests script over to you?

--
Peter Geoghegan

#72

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#71)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

23.08.2019 7:33, Peter Geoghegan wrote:

On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification?

No, I haven't checked it with the latest version yet.

There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

Do these indexes have something in common? Maybe some specific workload?
Are there any error messages in log?

I'd like to specify what caused the problem.
There were several major changes between v5 and v8:
- dead tuples handling added in v6;
- _bt_split changes for posting tuples in v7;
- WAL logging of posting tuple changes in v8.

I don't think the last one could break regular indexes on master.
Do you see the same regression in v6, v7?

I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:

Want me to send this data and the associated tests script over to you?

Yes, I think it will help me to debug the patch faster.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#73

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#67)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Now the algorithm is the following:

- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that approach?

--
Peter Geoghegan

#74

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#73)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

28.08.2019 6:19, Peter Geoghegan wrote:

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Now the algorithm is the following:
- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that approach?

I think it's a good idea. Thank you for such a detailed description of
various
cases. I already started to simplify this code, while debugging amcheck
error
in v8. At first, I rewrote it to split posting tuple into a posting and a
regular tuple instead of two posting tuples.

Your explanation helped me to understand that this approach can be
extended to
the case of insertion into posting list, that doesn't trigger posting
split,
and that nbtsplitloc indeed doesn't need to know about posting tuples
specific.
The code is much cleaner now.

Some individual indexes are larger, some are smaller compared to the
expected output.

This patch is based on v6, so it again contains "compression" instead of
"deduplication"
in variable names and comments. I will rename them when code becomes
more stable.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v9-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v9-0001-Compression-deduplication-in-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation.  Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..1751133 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,15 +47,17 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
-						   bool split_only_page);
+						   bool split_only_page, int in_posting_offset);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple neworigtup);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +299,12 @@ top:
 		 * search bounds established within _bt_check_unique when insertion is
 		 * checkingunique.
 		 */
+		insertstate.in_posting_offset = 0;
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+
+		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+					   stack, itup, newitemoff, false, insertstate.in_posting_offset);
 	}
 	else
 	{
@@ -435,6 +439,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -759,6 +764,26 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to compress the page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+		{
+			_bt_compress_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;		/* paranoia */
+
+			/*
+			 * FIXME: _bt_vacuum_one_page() won't have cleared the
+			 * BTP_HAS_GARBAGE flag when it didn't kill items.  Maybe we
+			 * should clear the BTP_HAS_GARBAGE flag bit from the page when
+			 * compression avoids a page split -- _bt_vacuum_one_page() is
+			 * expecting a page split that takes care of it.
+			 *
+			 * (On the other hand, maybe it doesn't matter very much.  A
+			 * comment update seems like the bare minimum we should do.)
+			 */
+		}
 	}
 	else
 	{
@@ -900,6 +925,75 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+
+/*
+ * Replace tuple on newitemoff offset with neworigtup,
+ * and insert newitup right after it.
+ *
+ * It's essential to do this atomic to be crash safe.
+ *
+ * NOTE All checks of free space must be done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+void
+_bt_replace_and_insert(Buffer buf,
+					  Page page,
+					  IndexTuple neworigtup, IndexTuple newitup,
+					  OffsetNumber newitemoff, bool need_xlog)
+{
+	Size		newitupsz = IndexTupleSize(newitup);
+	IndexTuple	origtup = (IndexTuple) PageGetItem(page,
+												   PageGetItemId(page, newitemoff));
+
+	Assert(BTreeTupleIsPosting(origtup));
+	Assert(BTreeTupleIsPosting(neworigtup));
+	Assert(!BTreeTupleIsPosting(newitup));
+	Assert(MAXALIGN(IndexTupleSize(origtup)) == MAXALIGN(IndexTupleSize(neworigtup)));
+
+	newitupsz = MAXALIGN(newitupsz);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * Since we always replace posting tuple with tuple of same size
+	 * (only posting list may changes), can do simple inplace update.
+	 */
+	memcpy(origtup, neworigtup, MAXALIGN(IndexTupleSize(neworigtup)));
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, OffsetNumberNext(newitemoff)))
+		elog(ERROR, "failed to insert compressed item in index");
+
+	if (BufferIsValid(buf))
+	{
+		MarkBufferDirty(buf);
+
+		/* Xlog stuff */
+		if (need_xlog)
+		{
+			xl_btree_insert xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.offnum = newitemoff;
+			xlrec.origtup_off = MAXALIGN(IndexTupleSize(newitup));
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+			Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(0, (char *) newitup, MAXALIGN(IndexTupleSize(newitup)));
+			XLogRegisterBufData(0, (char *) neworigtup, MAXALIGN(IndexTupleSize(neworigtup)));
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+
+			PageSetLSN(page, recptr);
+		}
+	}
+	END_CRIT_SECTION();
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
@@ -936,11 +1030,13 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   int in_posting_offset)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	neworigtup = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -964,6 +1060,120 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	if (in_posting_offset)
+	{
+		/* get old posting tuple */
+		ItemId 			itemid = PageGetItemId(page, newitemoff);
+		int				nipd;
+		IndexTuple		origtup;
+		char			*src;
+		char			*dest;
+		size_t			ntocopy;
+
+		origtup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPosting(origtup));
+		nipd = BTreeTupleGetNPosting(origtup);
+		Assert(in_posting_offset < nipd);
+		Assert(itup_key->scantid != NULL);
+		Assert(itup_key->heapkeyspace);
+
+		elog(DEBUG4, "origtup (%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+			ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+			ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+		/* generate neworigtup */
+
+		/*
+		 * Handle corner cases (1)
+		 *		- itup TID is smaller than leftmost orightup TID
+		 */
+		if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+								BTreeTupleGetHeapTID(origtup)) < 0)
+		{
+			in_posting_offset = InvalidOffsetNumber;
+			newitemoff = OffsetNumberPrev(newitemoff); //TODO Is it needed?
+			elog(DEBUG4, "itup is to the left of origtup newitemoff %u", newitemoff);
+		}
+		/*
+		 * Handle corner cases (2)
+		 *		- itup TID is larger than rightmost orightup TID
+		 */
+		else if (ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+							   BTreeTupleGetHeapTID(itup)) < 0)
+		{
+			/* do nothing */
+			in_posting_offset = InvalidOffsetNumber;
+			//newitemoff = OffsetNumberNext(newitemoff); //TODO Is it needed?
+			elog(DEBUG4, "itup is to the right of origtup newitemoff %u", newitemoff);
+		}
+		/* Handle insertion into the middle of the posting list */
+		else
+		{
+			neworigtup = CopyIndexTuple(origtup);
+			src = (char *)  BTreeTupleGetPostingN(neworigtup, in_posting_offset);
+			dest = (char *) src + sizeof(ItemPointerData);
+			ntocopy = (nipd - in_posting_offset - 1)*sizeof(ItemPointerData);
+
+			elog(DEBUG4, "itup is inside origtup"
+						 " nipd %d in_posting_offset %d ntocopy %lu newitemoff %u",
+						nipd, in_posting_offset, ntocopy, newitemoff);
+			elog(DEBUG4, "neworigtup before N %d (%u,%u) to (%u,%u)",
+				BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)));
+
+			elog(DEBUG4, "itup before (%u,%u)",
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)));
+			elog(DEBUG4, "src before (%u,%u)",
+				ItemPointerGetBlockNumberNoCheck((ItemPointer) src),
+				ItemPointerGetOffsetNumberNoCheck((ItemPointer) src));
+			elog(DEBUG4, "dest before (%u,%u)",
+				ItemPointerGetBlockNumberNoCheck((ItemPointer) dest),
+				ItemPointerGetOffsetNumberNoCheck((ItemPointer) dest));
+			/* move itemp pointers in posting list to free space for incoming one */
+			memmove(dest, src, ntocopy);
+
+			/* copy new item pointer to posting list */
+			ItemPointerCopy(&itup->t_tid, (ItemPointer) src);
+
+			/* copy old rightmost item pointer to new tuple, that we're going to insert */
+			ItemPointerCopy(BTreeTupleGetPostingN(origtup, nipd-1), &itup->t_tid);
+
+			elog(DEBUG4, "neworigtup N %d (%u,%u) to (%u,%u)",
+				BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)));
+
+// 			for (int i = 0; i < BTreeTupleGetNPosting(neworigtup); i++)
+// 			{
+// 				elog(WARNING, "neworigtup item n %d (%u,%u)",
+// 				i,
+// 				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetPostingN(neworigtup, i)),
+// 				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetPostingN(neworigtup, i)));
+// 			}
+
+			elog(DEBUG4, "itup (%u,%u)",
+				ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)));
+
+			Assert(!BTreeTupleIsPosting(itup));
+			Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+										BTreeTupleGetMaxTID(neworigtup)) < 0);
+
+			Assert(ItemPointerCompare(BTreeTupleGetMaxTID(neworigtup),
+									BTreeTupleGetHeapTID(itup)) < 0);
+		}
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1206,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf,
+						 newitemoff, itemsz, itup, neworigtup);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1033,70 +1244,152 @@ _bt_insertonpg(Relation rel,
 		itup_off = newitemoff;
 		itup_blkno = BufferGetBlockNumber(buf);
 
-		/*
-		 * If we are doing this insert because we split a page that was the
-		 * only one on its tree level, but was not the root, it may have been
-		 * the "fast root".  We need to ensure that the fast root link points
-		 * at or above the current page.  We can safely acquire a lock on the
-		 * metapage here --- see comments for _bt_newroot().
-		 */
-		if (split_only_page)
+		if (neworigtup == NULL)
 		{
-			Assert(!P_ISLEAF(lpageop));
+			/*
+			* If we are doing this insert because we split a page that was the
+			* only one on its tree level, but was not the root, it may have been
+			* the "fast root".  We need to ensure that the fast root link points
+			* at or above the current page.  We can safely acquire a lock on the
+			* metapage here --- see comments for _bt_newroot().
+			*/
+			if (split_only_page)
+			{
+				Assert(!P_ISLEAF(lpageop));
+
+				metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+				metapg = BufferGetPage(metabuf);
+				metad = BTPageGetMeta(metapg);
+
+				if (metad->btm_fastlevel >= lpageop->btpo.level)
+				{
+					/* no update wanted */
+					_bt_relbuf(rel, metabuf);
+					metabuf = InvalidBuffer;
+				}
+			}
 
-			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
-			metapg = BufferGetPage(metabuf);
-			metad = BTPageGetMeta(metapg);
+			/*
+			* Every internal page should have exactly one negative infinity item
+			* at all times.  Only _bt_split() and _bt_newroot() should add items
+			* that become negative infinity items through truncation, since
+			* they're the only routines that allocate new internal pages.  Do not
+			* allow a retail insertion of a new item at the negative infinity
+			* offset.
+			*/
+			if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
+				elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
+					itup_blkno, RelationGetRelationName(rel));
+
+			/* Do the update.  No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
+			if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
+				elog(PANIC, "failed to add new item to block %u in index \"%s\"",
+					itup_blkno, RelationGetRelationName(rel));
+
+			MarkBufferDirty(buf);
 
-			if (metad->btm_fastlevel >= lpageop->btpo.level)
+			if (BufferIsValid(metabuf))
 			{
-				/* no update wanted */
-				_bt_relbuf(rel, metabuf);
-				metabuf = InvalidBuffer;
+				/* upgrade meta-page if needed */
+				if (metad->btm_version < BTREE_NOVAC_VERSION)
+					_bt_upgrademetapage(metapg);
+				metad->btm_fastroot = itup_blkno;
+				metad->btm_fastlevel = lpageop->btpo.level;
+				MarkBufferDirty(metabuf);
 			}
-		}
 
-		/*
-		 * Every internal page should have exactly one negative infinity item
-		 * at all times.  Only _bt_split() and _bt_newroot() should add items
-		 * that become negative infinity items through truncation, since
-		 * they're the only routines that allocate new internal pages.  Do not
-		 * allow a retail insertion of a new item at the negative infinity
-		 * offset.
-		 */
-		if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
-			elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
-				 itup_blkno, RelationGetRelationName(rel));
+			/* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
+			if (BufferIsValid(cbuf))
+			{
+				Page		cpage = BufferGetPage(cbuf);
+				BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
 
-		/* Do the update.  No ereport(ERROR) until changes are logged */
-		START_CRIT_SECTION();
+				Assert(P_INCOMPLETE_SPLIT(cpageop));
+				cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+				MarkBufferDirty(cbuf);
+			}
 
-		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
-			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
-				 itup_blkno, RelationGetRelationName(rel));
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_btree_insert xlrec;
+				xl_btree_metadata xlmeta;
+				uint8		xlinfo;
+				XLogRecPtr	recptr;
 
-		MarkBufferDirty(buf);
+				xlrec.offnum = itup_off;
+				xlrec.origtup_off = 0;
 
-		if (BufferIsValid(metabuf))
-		{
-			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_NOVAC_VERSION)
-				_bt_upgrademetapage(metapg);
-			metad->btm_fastroot = itup_blkno;
-			metad->btm_fastlevel = lpageop->btpo.level;
-			MarkBufferDirty(metabuf);
-		}
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-		/* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
-		if (BufferIsValid(cbuf))
-		{
-			Page		cpage = BufferGetPage(cbuf);
-			BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+				if (P_ISLEAF(lpageop))
+					xlinfo = XLOG_BTREE_INSERT_LEAF;
+				else
+				{
+					/*
+					* Register the left child whose INCOMPLETE_SPLIT flag was
+					* cleared.
+					*/
+					XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
+
+					xlinfo = XLOG_BTREE_INSERT_UPPER;
+				}
+
+				if (BufferIsValid(metabuf))
+				{
+					Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+					xlmeta.version = metad->btm_version;
+					xlmeta.root = metad->btm_root;
+					xlmeta.level = metad->btm_level;
+					xlmeta.fastroot = metad->btm_fastroot;
+					xlmeta.fastlevel = metad->btm_fastlevel;
+					xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+					xlmeta.last_cleanup_num_heap_tuples =
+						metad->btm_last_cleanup_num_heap_tuples;
+
+					XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+					XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
+
+					xlinfo = XLOG_BTREE_INSERT_META;
+				}
+
+				XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+				recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+
+				if (BufferIsValid(metabuf))
+				{
+					PageSetLSN(metapg, recptr);
+				}
+				if (BufferIsValid(cbuf))
+				{
+					PageSetLSN(BufferGetPage(cbuf), recptr);
+				}
 
-			Assert(P_INCOMPLETE_SPLIT(cpageop));
-			cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
-			MarkBufferDirty(cbuf);
+				PageSetLSN(page, recptr);
+			}
+			END_CRIT_SECTION();
 		}
+		else
+		{
+			/*
+			 * Insert new tuple on place of existing posting tuple.
+			 * Delete old posting tuple, and insert updated tuple instead.
+			 *
+			 * If split was needed, both neworigtup and newrighttup are initialized
+			 * and both will be inserted, otherwise newrighttup is NULL.
+			 *
+			 * It only can happen on leaf page.
+			 */
+			elog(DEBUG4, "_bt_insertonpg. _bt_replace_and_insert %s newitemoff %u",
+			  RelationGetRelationName(rel), newitemoff);
+			_bt_replace_and_insert(buf, page, neworigtup,
+								  itup, newitemoff, RelationNeedsWAL(rel));
+ 		}
 
 		/*
 		 * Cache the block information if we just inserted into the rightmost
@@ -1107,69 +1400,6 @@ _bt_insertonpg(Relation rel,
 		if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
 			cachedBlock = BufferGetBlockNumber(buf);
 
-		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
-		{
-			xl_btree_insert xlrec;
-			xl_btree_metadata xlmeta;
-			uint8		xlinfo;
-			XLogRecPtr	recptr;
-
-			xlrec.offnum = itup_off;
-
-			XLogBeginInsert();
-			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
-
-			if (P_ISLEAF(lpageop))
-				xlinfo = XLOG_BTREE_INSERT_LEAF;
-			else
-			{
-				/*
-				 * Register the left child whose INCOMPLETE_SPLIT flag was
-				 * cleared.
-				 */
-				XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
-
-				xlinfo = XLOG_BTREE_INSERT_UPPER;
-			}
-
-			if (BufferIsValid(metabuf))
-			{
-				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
-				xlmeta.version = metad->btm_version;
-				xlmeta.root = metad->btm_root;
-				xlmeta.level = metad->btm_level;
-				xlmeta.fastroot = metad->btm_fastroot;
-				xlmeta.fastlevel = metad->btm_fastlevel;
-				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
-				xlmeta.last_cleanup_num_heap_tuples =
-					metad->btm_last_cleanup_num_heap_tuples;
-
-				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
-				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
-
-				xlinfo = XLOG_BTREE_INSERT_META;
-			}
-
-			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
-
-			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
-
-			if (BufferIsValid(metabuf))
-			{
-				PageSetLSN(metapg, recptr);
-			}
-			if (BufferIsValid(cbuf))
-			{
-				PageSetLSN(BufferGetPage(cbuf), recptr);
-			}
-
-			PageSetLSN(page, recptr);
-		}
-
-		END_CRIT_SECTION();
-
 		/* release buffers */
 		if (BufferIsValid(metabuf))
 			_bt_relbuf(rel, metabuf);
@@ -1211,10 +1441,20 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		TODO improve comment
+ *		The real *new* item is already inside neorigtup in the correct place according to TID order
+ *		And "newitem" contains rightmost ItemPointerData trimmed from posting list.
+ *		Insertion consists of two steps
+ * 			- replace original item at newitemoff with neworigtup
+ * 			 This operation doesn't change origtup size, so all calculations
+ * 			 of splitloc remain the same.
+ * 			- insert newitem right after that as if we inserted a regular tuple
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple neworigtup)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1476,19 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replaceitemoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	if (neworigtup != NULL)
+	{
+		replaceitemoff = newitemoff;
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1340,6 +1587,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == replaceitemoff)
+			item = neworigtup;
 	}
 
 	/*
@@ -1373,6 +1622,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == replaceitemoff)
+				lastleft = neworigtup;
 		}
 
 		Assert(lastleft != item);
@@ -1480,6 +1731,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* TODO add comment */
+		if (i == replaceitemoff)
+		{
+			item = neworigtup;
+			Assert(neworigtup != NULL);
+		}
+
 		/* does new item belong before this one? */
 		if (i == newitemoff)
 		{
@@ -1652,6 +1910,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+ 		xlrec.replaceitemoff = replaceitemoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1681,6 +1940,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
+		if (replaceitemoff)
+			XLogRegisterBufData(0, (char *) neworigtup,
+								MAXALIGN(IndexTupleSize(neworigtup)));
+
 		/*
 		 * Log the contents of the right page in the format understood by
 		 * _bt_restore_page().  The whole right page will be recreated.
@@ -1835,7 +2098,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
-					   is_only);
+					   is_only, InvalidOffsetNumber);
 
 		/* be tidy */
 		pfree(new_item);
@@ -2307,3 +2570,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * the page.
 	 */
 }
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+		 compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "failed to add tuple to page while compresing it");
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+					   IndexRelationGetNumberOfAttributes(rel) &&
+					   !rel->rd_index->indisunique);
+	if (!use_compression)
+		return;
+
+	/* init compress state needed to build posting tuples */
+	compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+	compressState->ipd = NULL;
+	compressState->ntuples = 0;
+	compressState->itupprev = NULL;
+	compressState->maxitemsize = BTMaxItemSize(page);
+	compressState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	
+	/*
+	 * Delete dead tuples if any.
+	 * We cannot simply skip them in the cycle below, because it's neccessary
+	 * to generate special Xlog record containing such tuples to compute
+	 * latestRemovedXid on a standby server later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and  _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemid = PageGetItemId(page, P_HIKEY);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+	/*
+	 * Scan over all items to see which ones can be compressed
+	 */
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	newpage = PageGetTempPageCopySpecial(page);
+	elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+		 RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during compression");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to compress them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		if (compressState->itupprev != NULL)
+		{
+			int			n_equal_atts =
+			_bt_keep_natts_fast(rel, compressState->itupprev, itup);
+			int			itup_ntuples = BTreeTupleIsPosting(itup) ?
+			BTreeTupleGetNPosting(itup) : 1;
+
+			if (n_equal_atts > natts)
+			{
+				/*
+				 * When tuples are equal, create or update posting.
+				 *
+				 * If posting is too big, insert it on page and continue.
+				 */
+				if (compressState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(compressState->itupprev)
+							   + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(compressState, itup);
+				}
+				else
+				{
+					insert_itupprev_to_page(newpage, compressState);
+				}
+			}
+			else
+			{
+				insert_itupprev_to_page(newpage, compressState);
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (compressState->itupprev)
+			pfree(compressState->itupprev);
+		compressState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	insert_itupprev_to_page(newpage, compressState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..fca35a4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..a85c67b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,81 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					elog(DEBUG4, "rel %s btreevacuumPosting offnum %u",
+						 RelationGetRelationName(vstate->info->index), offnum);
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1331,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1348,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1434,47 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		elog(DEBUG4, "rel %s btreevacuumPosting i %d,  (%u,%u)",
+			RelationGetRelationName(vstate->info->index),
+			 i,
+			ItemPointerGetBlockNumberNoCheck( (items + i)),
+			ItemPointerGetOffsetNumberNoCheck((items + i)));
+
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed2..72e52bc 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+								OffsetNumber offnum, ItemPointer iptr,
+								IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -497,7 +500,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		result = _bt_compare_posting(rel, key, page, mid,
+									 &(insertstate->in_posting_offset));
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -526,6 +530,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+					BTScanInsert key,
+					Page page,
+					OffsetNumber offnum,
+					int *in_posting_offset)
+{
+	IndexTuple	itup;
+	int			result;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	result = _bt_compare(rel, key, page, offnum);
+
+	if (BTreeTupleIsPosting(itup) && result == 0)
+	{
+		int			low,
+					high,
+					mid,
+					res;
+
+		low = 0;
+		/* "high" is past end of posting list for loop invariant */
+		high = BTreeTupleGetNPosting(itup);
+
+		while (high > low)
+		{
+			mid = low + ((high - low) / 2);
+			res = ItemPointerCompare(key->scantid,
+									 BTreeTupleGetPostingN(itup, mid));
+
+			if (res >= 1)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+
+		*in_posting_offset = high;
+	}
+
+	return result;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -658,61 +711,120 @@ _bt_compare(Relation rel,
 	 * Use the heap TID attribute and scantid to try to break the tie.  The
 	 * rules are the same as any other key attribute -- only the
 	 * representation differs.
+	 *
+	 * When itup is a posting tuple, the check becomes more complex. It is
+	 * possible that the scankey belongs to the tuple's posting list TID
+	 * range.
+	 *
+	 * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+	 * matches tuple at this offset.
+	 *
+	 * Use special _bt_compare_posting() wrapper function to handle this case
+	 * and perform recheck for posting tuple, finding exact position of the
+	 * scankey.
 	 */
-	heapTid = BTreeTupleGetHeapTID(itup);
-	if (key->scantid == NULL)
+	if (!BTreeTupleIsPosting(itup))
 	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid == NULL)
+		{
+			/*
+			 * Most searches have a scankey that is considered greater than a
+			 * truncated pivot tuple if and when the scankey has equal values
+			 * for attributes up to and including the least significant
+			 * untruncated attribute in tuple.
+			 *
+			 * For example, if an index has the minimum two attributes (single
+			 * user key attribute, plus heap TID attribute), and a page's high
+			 * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+			 * search will not descend to the page to the left.  The search
+			 * will descend right instead.  The truncated attribute in pivot
+			 * tuple means that all non-pivot tuples on the page to the left
+			 * are strictly < 'foo', so it isn't necessary to descend left. In
+			 * other words, search doesn't have to descend left because it
+			 * isn't interested in a match that has a heap TID value of -inf.
+			 *
+			 * However, some searches (pivotsearch searches) actually require
+			 * that we descend left when this happens.  -inf is treated as a
+			 * possible match for omitted scankey attribute(s).  This is
+			 * needed by page deletion, which must re-find leaf pages that are
+			 * targets for deletion using their high keys.
+			 *
+			 * Note: the heap TID part of the test ensures that scankey is
+			 * being compared to a pivot tuple with one or more truncated key
+			 * attributes.
+			 *
+			 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+			 * the left here, since they have no heap TID attribute (and
+			 * cannot have any -inf key values in any case, since truncation
+			 * can only remove non-key attributes).  !heapkeyspace searches
+			 * must always be prepared to deal with matches on both sides of
+			 * the pivot once the leaf level is reached.
+			 */
+			if (key->heapkeyspace && !key->pivotsearch &&
+				key->keysz == ntupatts && heapTid == NULL)
+				return 1;
+
+			/* All provided scankey arguments found to be equal */
+			return 0;
+		}
+
 		/*
-		 * Most searches have a scankey that is considered greater than a
-		 * truncated pivot tuple if and when the scankey has equal values for
-		 * attributes up to and including the least significant untruncated
-		 * attribute in tuple.
-		 *
-		 * For example, if an index has the minimum two attributes (single
-		 * user key attribute, plus heap TID attribute), and a page's high key
-		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
-		 * will not descend to the page to the left.  The search will descend
-		 * right instead.  The truncated attribute in pivot tuple means that
-		 * all non-pivot tuples on the page to the left are strictly < 'foo',
-		 * so it isn't necessary to descend left.  In other words, search
-		 * doesn't have to descend left because it isn't interested in a match
-		 * that has a heap TID value of -inf.
-		 *
-		 * However, some searches (pivotsearch searches) actually require that
-		 * we descend left when this happens.  -inf is treated as a possible
-		 * match for omitted scankey attribute(s).  This is needed by page
-		 * deletion, which must re-find leaf pages that are targets for
-		 * deletion using their high keys.
-		 *
-		 * Note: the heap TID part of the test ensures that scankey is being
-		 * compared to a pivot tuple with one or more truncated key
-		 * attributes.
-		 *
-		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
-		 * left here, since they have no heap TID attribute (and cannot have
-		 * any -inf key values in any case, since truncation can only remove
-		 * non-key attributes).  !heapkeyspace searches must always be
-		 * prepared to deal with matches on both sides of the pivot once the
-		 * leaf level is reached.
+		 * Treat truncated heap TID as minus infinity, since scankey has a key
+		 * attribute value (scantid) that would otherwise be compared directly
 		 */
-		if (key->heapkeyspace && !key->pivotsearch &&
-			key->keysz == ntupatts && heapTid == NULL)
+		Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+		if (heapTid == NULL)
 			return 1;
 
-		/* All provided scankey arguments found to be equal */
-		return 0;
+		Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+		return ItemPointerCompare(key->scantid, heapTid);
 	}
+	else
+	{
+		heapTid = BTreeTupleGetHeapTID(itup);
+		if (key->scantid != NULL && heapTid != NULL)
+		{
+			int			cmp = ItemPointerCompare(key->scantid, heapTid);
 
-	/*
-	 * Treat truncated heap TID as minus infinity, since scankey has a key
-	 * attribute value (scantid) that would otherwise be compared directly
-	 */
-	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
-	if (heapTid == NULL)
-		return 1;
+			if (cmp == -1 || cmp == 0)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
 
-	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+			heapTid = BTreeTupleGetMaxTID(itup);
+			cmp = ItemPointerCompare(key->scantid, heapTid);
+			if (cmp == 1)
+			{
+				elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+					 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+					 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+					 ItemPointerGetBlockNumberNoCheck(heapTid),
+					 ItemPointerGetOffsetNumberNoCheck(heapTid));
+				return cmp;
+			}
+
+			/*
+			 * if we got here, scantid is inbetween of posting items of the
+			 * tuple
+			 */
+			elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+				 offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+				 ItemPointerGetOffsetNumberNoCheck(key->scantid),
+				 ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+				 ItemPointerGetBlockNumberNoCheck(heapTid),
+				 ItemPointerGetOffsetNumberNoCheck(heapTid));
+			return 0;
+		}
+	}
+
+	return 0;
 }
 
 /*
@@ -1449,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1483,8 +1596,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1517,7 +1644,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1525,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1567,8 +1694,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					/* Return posting list "logical" tuples */
+					/* XXX: Maybe this loop should be backwards? */
+					for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup, i);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1582,8 +1724,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1596,6 +1738,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1608,6 +1752,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..d7207e0 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTCompressState *compressState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -963,6 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1051,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1128,6 +1137,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTCompressState *compressState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (compressState->ntuples == 0)
+		to_insert = compressState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+											 compressState->ipd,
+											 compressState->ntuples);
+		to_insert = postingtuple;
+		pfree(compressState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (compressState->ntuples > 0)
+		pfree(to_insert);
+	compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (compressState->ntuples == 0)
+	{
+		compressState->ipd = palloc0(compressState->maxitemsize);
+
+		if (BTreeTupleIsPosting(compressState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(compressState->itupprev);
+			memcpy(compressState->ipd,
+				   BTreeTupleGetPosting(compressState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			compressState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(compressState->ipd, compressState->itupprev,
+				   sizeof(ItemPointerData));
+			compressState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(compressState->ipd + compressState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		compressState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(compressState->ipd + compressState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		compressState->ntuples++;
+	}
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1141,9 +1235,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		use_compression = false;
+	BTCompressState *compressState = NULL;
+
+	/*
+	 * Don't use compression for indexes with INCLUDEd columns and unique
+	 * indexes.
+	 */
+	use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+					   IndexRelationGetNumberOfAttributes(wstate->index) &&
+					   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1362,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!use_compression)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init compress state needed to build posting tuples */
+			compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+			compressState->ipd = NULL;
+			compressState->ntuples = 0;
+			compressState->itupprev = NULL;
+			compressState->maxitemsize = 0;
+			compressState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (compressState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   compressState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+							compressState->maxpostingsize)
+							_bt_add_posting_item(compressState, itup);
+						else
+							_bt_buildadd_posting(wstate, state,
+												 compressState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, compressState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (compressState->itupprev)
+					pfree(compressState->itupprev);
+				compressState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				compressState->maxpostingsize = compressState->maxitemsize -
+					IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(compressState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, compressState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..0ead2ea 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
 							 && !newitemonleft);
 
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+							  BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId	 itemid;
+			IndexTuple newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+								  BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0..7be2542 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->nextkey = false;
 	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+	else
+		key->scantid = NULL;
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2151,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2177,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2186,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal. But
+		 * the tuple is a compressed tuple with a posting list, so we still
+		 * must truncate it.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = BTreeTupleGetPostingOffset(firstright) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2205,7 +2244,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2255,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2273,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2282,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2373,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff.  For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case.  Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8.  As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that).  In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts().  This needs to be nailed down
+ * in some way.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2477,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2532,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2559,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2549,6 +2611,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
 		return;
 
+	/* TODO correct error messages for posting tuples */
+
 	/*
 	 * Internal page insertions cannot fail here, because that would mean that
 	 * an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2639,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..2015a5b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -163,6 +163,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 
+
 	/*
 	 * Insertion to an internal page finishes an incomplete split at the child
 	 * level.  Clear the incomplete-split flag in the child.  Note: during
@@ -178,9 +179,23 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 	{
 		Size		datalen;
 		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+		IndexTuple neworigtup = NULL;
+
 
 		page = BufferGetPage(buffer);
 
+		if (xlrec->origtup_off > 0)
+		{
+			IndexTuple origtup = (IndexTuple) PageGetItem(page,
+														  PageGetItemId(page, xlrec->offnum));
+			neworigtup = (IndexTuple) (datapos + xlrec->origtup_off);
+
+			Assert(MAXALIGN(IndexTupleSize(origtup)) == MAXALIGN(IndexTupleSize(neworigtup)));
+
+			memcpy(origtup, neworigtup, MAXALIGN(IndexTupleSize(neworigtup)));
+			xlrec->offnum = OffsetNumberNext(xlrec->offnum);
+		}
+
 		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
 						false, false) == InvalidOffsetNumber)
 			elog(PANIC, "btree_xlog_insert: failed to add item");
@@ -265,9 +280,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					replaceitem = NULL;
 		Size		newitemsz = 0,
-					left_hikeysz = 0;
+					left_hikeysz = 0,
+					replaceitemsz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -287,6 +304,14 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		datapos += left_hikeysz;
 		datalen -= left_hikeysz;
 
+		if (xlrec->replaceitemoff)
+		{
+			replaceitem = (IndexTuple) datapos;
+			replaceitemsz = MAXALIGN(IndexTupleSize(replaceitem));
+			datapos += replaceitemsz;
+			datalen -= replaceitemsz;
+		}
+
 		Assert(datalen == 0);
 
 		newlpage = PageGetTempPageCopySpecial(lpage);
@@ -304,6 +329,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == xlrec->replaceitemoff)
+			{
+				if (PageAddItem(newlpage, (Item) replaceitem, replaceitemsz, leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -386,8 +420,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +512,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
+
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..243e464 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -31,6 +31,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
 				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "origtup_off %lu", xlrec->origtup_off);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -46,8 +47,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..d2700fc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID.  The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID().  The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
 
-/* Get/set downlink block number */
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +661,12 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -566,6 +733,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for posting list handling */
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +747,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -732,7 +901,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
-
+extern void _bt_replace_and_insert(Buffer buf, Page page,
+								   IndexTuple neworigtup, IndexTuple newitup,
+								   OffsetNumber newitemoff, bool need_xlog);
 /*
  * prototypes for functions in nbtsplitloc.c
  */
@@ -762,6 +933,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -774,6 +947,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+								 OffsetNumber offnum, int *in_posting_offset);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -812,6 +987,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -824,5 +1002,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+								 IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..f1ef584 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,24 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if origtup_off is not 0, data also contains 'neworigtup' -
+ *				 tuple to replace original (see comments in bt_replace_and_insert()).
+ *				 TODO probably it would be enough to keep just a flag to point out that
+ *				 data contains 'neworigtup' and compute its offset
+ *				 as we know it follows the tuple, but I am afraid that
+ *				 it will break alignment, will it?
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	Size		 origtup_off;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, origtup_off) + sizeof(Size))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -96,6 +104,12 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'neworigtup',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -113,9 +127,10 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (if placed on left page) */
+	OffsetNumber replaceitemoff; /* offset of the posting item to replace with (neworigtup) */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, replaceitemoff) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +188,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.

#75

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#74)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Your explanation helped me to understand that this approach can be
extended to
the case of insertion into posting list, that doesn't trigger posting
split,
and that nbtsplitloc indeed doesn't need to know about posting tuples
specific.
The code is much cleaner now.

Fantastic!

Some individual indexes are larger, some are smaller compared to the
expected output.

I agree that v9 might be ever so slightly more space efficient than v5
was, on balance. In any case v9 completely fixes the regression that I
saw in the last version. I have pushed the changes to the test output
for the serial tests that I privately maintain, that I gave you access
to. The MGD test output also looks perfect.

We may find that deduplication is a little too effective, in the sense
that it packs so many tuples on to leaf pages that *concurrent*
inserters will tend to get excessive page splits. We may find that it
makes sense to aim for posting lists that are maybe 96% of
BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this
reason. Concurrent inserters will tend to have heap TIDs that are
slightly out of order, so we want to at least have enough space
remaining on the left half of a "single value mode" split. We may end
up with a design where deduplication anticipates what will be useful
for nbtsplitloc.c.

I still think that it's too early to start worrying about problems
like this one -- I feel it will be useful to continue to focus on the
code and the space utilization of the serial test cases for now. We
can look at it at the same time that we think about adding back
something like BT_COMPRESS_THRESHOLD. I am mentioning it now because
it's probably a good time for you to start thinking about it, if you
haven't already (actually, maybe I'm just describing what
BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll
need to have a good benchmark to assess these questions, and it's not
obvious what that will be. Two possible candidates are TPC-H and
TPC-E. (Of course, I mean running them for real -- not using their
indexes to make sure that the nbtsplitloc.c stuff works well in
isolation.)

Any thoughts on a conventional benchmark that allows us to understand
the patch's impact on both throughput and latency?

BTW, I notice that we often have indexes that are quite a lot smaller
when they were created with retail insertions rather than with CREATE
INDEX/REINDEX. This is not new, but the difference is much larger than
it typically is without the patch. For example, the TPC-E index on
trade.t_ca_id (which is named "i_t_ca_id" or "i_t_ca_id2" in my test)
is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions
(assuming the insertions use the actual order from the test). I'm not
sure what to do about this, if anything. I mean, the reason that the
retail insertions do better is that they have the nbtsplitloc.c stuff,
and because we don't split the page until it's 100% full and until
deduplication stops helping -- we could apply several rounds of
deduplication before we actually have to split the cage. So the
difference that we see here is both logical and surprising.

How do you feel about this CREATE INDEX index-size-is-larger business?

--
Peter Geoghegan

#76

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#75)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Aug 29, 2019 at 5:07 PM Peter Geoghegan <pg@bowt.ie> wrote:

I agree that v9 might be ever so slightly more space efficient than v5
was, on balance.

I see some Valgrind errors on v9, all of which look like the following
two sample errors I go into below.

First one:

==11193== VALGRINDERROR-BEGIN
==11193== Unaddressable byte(s) found during client check request
==11193== at 0x4C0E03: PageAddItemExtended (bufpage.c:332)
==11193== by 0x20F6C3: _bt_split (nbtinsert.c:1643)
==11193== by 0x20F6C3: _bt_insertonpg (nbtinsert.c:1206)
==11193== by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193== by 0x3B85A1: main (main.c:210)
==11193== Address 0x9c11350 is 0 bytes after a recently re-allocated
block of size 8,192 alloc'd
==11193== at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193== by 0x61085A: AllocSetAlloc (aset.c:914)
==11193== by 0x617AD8: palloc (mcxt.c:938)
==11193== by 0x21A829: _bt_mkscankey (nbtutils.c:107)
==11193== by 0x2118F3: _bt_doinsert (nbtinsert.c:93)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==
==11193== VALGRINDERROR-END
{
<insert_a_suppression_name_here>
Memcheck:User
fun:PageAddItemExtended
fun:_bt_split
fun:_bt_insertonpg
fun:_bt_doinsert
fun:btinsert
fun:index_insert
fun:ExecInsertIndexTuples
fun:ExecInsert
fun:ExecModifyTable
fun:ExecProcNodeFirst
fun:ExecProcNode
fun:ExecutePlan
fun:standard_ExecutorRun
fun:ExecutorRun
fun:ProcessQuery
fun:PortalRunMulti
fun:PortalRun
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
fun:BackendStartup
fun:ServerLoop
fun:PostmasterMain
fun:main
}

nbtinsert.c:1643 is the first PageAddItem() in _bt_split() -- the
lefthikey call.

Second one:

==11193== VALGRINDERROR-BEGIN
==11193== Invalid read of size 2
==11193== at 0x20FDF5: _bt_insertonpg (nbtinsert.c:1126)
==11193== by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193== by 0x3B85A1: main (main.c:210)
==11193== Address 0x9905b90 is 11,088 bytes inside a recently
re-allocated block of size 524,288 alloc'd
==11193== at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193== by 0x61085A: AllocSetAlloc (aset.c:914)
==11193== by 0x617AD8: palloc (mcxt.c:938)
==11193== by 0x1C5677: CopyIndexTuple (indextuple.c:508)
==11193== by 0x20E887: _bt_compress_one_page (nbtinsert.c:2751)
==11193== by 0x21241E: _bt_findinsertloc (nbtinsert.c:773)
==11193== by 0x21241E: _bt_doinsert (nbtinsert.c:303)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193==
==11193== VALGRINDERROR-END
{
<insert_a_suppression_name_here>
Memcheck:Addr2
fun:_bt_insertonpg
fun:_bt_doinsert
fun:btinsert
fun:index_insert
fun:ExecInsertIndexTuples
fun:ExecInsert
fun:ExecModifyTable
fun:ExecProcNodeFirst
fun:ExecProcNode
fun:ExecutePlan
fun:standard_ExecutorRun
fun:ExecutorRun
fun:ProcessQuery
fun:PortalRunMulti
fun:PortalRun
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
fun:BackendStartup
fun:ServerLoop
fun:PostmasterMain
fun:main
}

nbtinsert.c:1126 is this code from _bt_insertonpg():

elog(DEBUG4, "dest before (%u,%u)",
ItemPointerGetBlockNumberNoCheck((ItemPointer) dest),
ItemPointerGetOffsetNumberNoCheck((ItemPointer) dest));

This is probably harmless, but it needs to be fixed.

--
Peter Geoghegan

#77

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#76)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Aug 29, 2019 at 10:10 PM Peter Geoghegan <pg@bowt.ie> wrote:

I see some Valgrind errors on v9, all of which look like the following
two sample errors I go into below.

I've found a fix for these Valgrind issues. It's a matter of making
sure that _bt_truncate() sizes new pivot tuples properly, which is
quite subtle:

--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2155,8 +2155,11 @@ _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright,
         {
             BTreeTupleClearBtIsPosting(pivot);
             BTreeTupleSetNAtts(pivot, keepnatts);
-            pivot->t_info &= ~INDEX_SIZE_MASK;
-            pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+            if (keepnatts == natts)
+            {
+                pivot->t_info &= ~INDEX_SIZE_MASK;
+                pivot->t_info |=
MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+            }
         }

I'm varying how the new pivot tuple is sized here according to whether
or not index_truncate_tuple() just does a CopyIndexTuple(). This very
slightly changes the behavior of the nbtsplitloc.c stuff, but that's
not a concern for me.

I will post a patch with this and other tweaks next week.

--
Peter Geoghegan

#78

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#77)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Sat, Aug 31, 2019 at 1:04 AM Peter Geoghegan <pg@bowt.ie> wrote:

I've found a fix for these Valgrind issues.

Attach is v10, which fixes the Valgrind issue.

Other changes:

* The code now fully embraces the idea that posting list splits
involve "changing the incoming item" in a way that "avoids" having the
new/incoming item overlap with an existing posting list tuple. This
allowed me to cut down on the changes required within nbtinsert.c
considerably.

* Streamlined a lot of the code in nbtsearch.c. I was able to
significantly simplify _bt_compare() and _bt_binsrch_insert().

* Removed the DEBUG4 traces. A lot of these had to go when I
refactored nbtsearch.c code, so I thought I might as well removed the
remaining ones. I hope that you don't mind (go ahead and add them back
where that makes sense).

* A backwards scan will return "logical tuples" in descending order
now. We should do this on general principle, and also because of the
possibility of future external code that expects and takes advantage
of consistent heap TID order.

This change might even have a small performance benefit today, though:
Index scans that visit multiple heap pages but only match on a single
key will only pin each heap page visited once. Visiting the heap pages
in descending order within a B-Tree page full of duplicates, but
ascending order within individual posting lists could result in
unnecessary extra pinning.

* Standardized terminology. We consistently call what the patch adds
"deduplication" rather than "compression".

* Added a new section on the design to the nbtree README. This is
fairly high level, and talks about dynamics that we can't really talk
about anywhere else, such as how nbtsplitloc.c "cooperates" with
deduplication, producing an effect that is greater than the sum of its
parts.

* I also made some changes to the WAL logging for leaf page insertions
and page splits.

I didn't add the optimization that you anticipated in your nbtxlog.h
comments (i.e. only WAL-log a rewritten posting list when it will go
on the left half of the split, just like the new/incoming item thing
we have already). I agree that that's a good idea, and should be added
soon. Actually, I think the whole "new item vs. rewritten posting list
item" thing makes the WAL logging confusing, so this is not really
about performance.

Maybe the easiest way to do this is also the way that performs best.
I'm thinking of this: maybe we could completely avoid WAL-logging the
entire rewritten/split posting list. After all, the contents of the
rewritten posting list are derived from the existing/original posting
list, as well as the new/incoming item. We can make the WAL record
much smaller on average by making standbys repeat a little bit of the
work performed on the primary. Maybe we could WAL-log
"in_posting_offset" itself, and an ItemPointerData (obviously the new
item offset number tells us the offset number of the posting list that
must be replaced/memmoved()'d). Then have the standby repeat some of
the work performed on the primary -- at least the work of swapping a
heap TID could be repeated on standbys, since it's very little extra
work for standbys, but could really reduce the WAL volume. This might
actually be simpler.

The WAL logging that I didn't touch in v10 is the most important thing
to improve. I am talking about the WAL-logging that is performed as
part of deduplicating all items on a page, to avoid a page split (i.e.
the WAL-logging within _bt_dedup_one_page()). That still just does a
log_newpage_buffer() in v10, which is pretty inefficient. Much like
the posting list split WAL logging stuff, WAL logging in
_bt_dedup_one_page() can probably be made more efficient by describing
deduplication in terms of logical changes. For example, the WAL
records should consist of metadata that could be read by a human as
"merge the tuples from offset number 15 until offset number 27".
Perhaps this could also share code with the posting list split stuff.
What do you think?

Once we make the WAL-logging within _bt_dedup_one_page() more
efficient, that also makes it fairly easy to make the deduplication
that it performs occur incrementally, maybe even very incrementally. I
can imagine the _bt_dedup_one_page() caller specifying "my new tuple
is 32 bytes, and I'd really like to not have to split the page, so
please at least do enough deduplication to make it fit". Delaying
deduplication increases the amount of time that we have to set the
LP_DEAD bit for remaining items on the page, which might be important.
Also, spreading out the volume of WAL produced by deduplication over
time might be important with certain workloads. We would still
probably do somewhat more work than strictly necessary to avoid a page
split if we were to make _bt_dedup_one_page() incremental like this,
though not by a huge amount.

OTOH, maybe I am completely wrong about "incremental deduplication"
being a good idea. It seems worth experimenting with, though. It's not
that much more work on top of making the _bt_dedup_one_page()
WAL-logging efficient, which seems like the thing we should focus on
now.

Thoughts?
--
Peter Geoghegan

Attachments:

v10-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v10-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 92d9c62d9c92da8e876d07d4335572c8eded0ae8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v10 2/2] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 67 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..f95f3ad892 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,52 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +406,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +437,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +523,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v10-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v10-0001-Add-deduplication-to-nbtree.patchDownload

From 6c1bb94b2f9c39af784f2d7ebe461251a63a71ba Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v10 1/2] Add deduplication to nbtree.

---
 contrib/amcheck/verify_nbtree.c         | 126 ++++++--
 src/backend/access/nbtree/README        |  70 ++++-
 src/backend/access/nbtree/nbtinsert.c   | 379 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     |  53 ++++
 src/backend/access/nbtree/nbtree.c      | 143 +++++++--
 src/backend/access/nbtree/nbtsearch.c   | 245 +++++++++++++--
 src/backend/access/nbtree/nbtsort.c     | 196 +++++++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  47 ++-
 src/backend/access/nbtree/nbtutils.c    | 210 +++++++++++--
 src/backend/access/nbtree/nbtxlog.c     |  88 +++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  10 +-
 src/include/access/nbtree.h             | 206 ++++++++++++-
 src/include/access/nbtxlog.h            |  36 ++-
 src/tools/valgrind.supp                 |  21 ++
 14 files changed, 1688 insertions(+), 142 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..f2ebd215b2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (!BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset == 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,16 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..2be064153d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples never have the LP_DEAD bit set, since each
+"logical" tuple may or may not be "known dead".)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,71 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We cannot set the LP_DEAD bit with posting list
+tuples. (Bitmap scans cannot perform LP_DEAD bit setting, and are the
+common case with indexes that contain lots of duplicates, so this downside
+is considered acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters).  When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed.  Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, space
+utilization is improved and page fragmentation is avoided by keeping
+existing posting lists large.
+
+Currently, posting lists are not compressed.  It would be straightforward
+to add GIN-style posting list compression based on varbyte encoding.  That
+would probably need to be configurable and not enabled by default, because
+the overhead of decompression would be an obvious downside, especially with
+backwards scans.
+
+TODO: Review whether or not basic deduplication should be enabled by
+default.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..f2fe3f77ce 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,25 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple nposting);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTDedupState *dedupState);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +127,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +305,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -759,6 +765,15 @@ _bt_findinsertloc(Relation rel,
 			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
 			insertstate->bounds_valid = false;
 		}
+
+		/*
+		 * If the target page is full, try to deduplicate items on page
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+		{
+			_bt_dedup_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;	/* paranoia */
+		}
 	}
 	else
 	{
@@ -905,10 +920,11 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +934,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +953,14 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +974,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +986,70 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+		int			nipd;
+		char	   *replacepos;
+		char	   *rightpos;
+		Size		nbytes;
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPosting(oposting));
+		nipd = BTreeTupleGetNPosting(oposting);
+		Assert(in_posting_offset < nipd);
+
+		nposting = CopyIndexTuple(oposting);
+		replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+		rightpos = replacepos + sizeof(ItemPointerData);
+		nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+		/*
+		 * Move item pointers in posting list to make a gap for the new item's
+		 * heap TID (shift TIDs one place to the right, losing original
+		 * rightmost TID).
+		 */
+		memmove(rightpos, replacepos, nbytes);
+
+		/*
+		 * Replace newitem's heap TID with rightmost heap TID from original
+		 * posting list
+		 */
+		ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+		/*
+		 * Copy original (not new original) posting list's last TID into new
+		 * item
+		 */
+		ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+		Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								  BTreeTupleGetHeapTID(itup)) < 0);
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1082,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 nposting);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1162,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place
+			 * update of the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1215,9 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingsz = 0;
+			if (nposting)
+				xlrec.postingsz = MAXALIGN(IndexTupleSize(itup));
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1255,9 @@ _bt_insertonpg(Relation rel,
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (nposting)
+				XLogRegisterBufData(0, (char *) nposting,
+									IndexTupleSize(nposting));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1299,10 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
 }
 
 /*
@@ -1211,10 +1320,16 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple nposting)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1351,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1396,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1470,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1506,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1616,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1652,6 +1803,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.replacepostingoff = replacepostingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1676,6 +1828,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
+		if (replacepostingoff)
+			XLogRegisterBufData(0, (char *) nposting,
+								MAXALIGN(IndexTupleSize(nposting)));
+
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
@@ -1834,7 +1990,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2460,209 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's neccessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		itemid = PageGetItemId(page, P_HIKEY);
+		Size		itemsz = ItemIdGetLength(itemid);
+		IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+		if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+		if (dedupState->itupprev != NULL)
+		{
+			if (_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+			{
+				int			itup_ntuples;
+
+				/*
+				 * Tuples are equal.
+				 *
+				 * If posting list is too big, insert it on page and continue
+				 * with this tuple as new pending posting list.  Otherwise,
+				 * append the tuple to the pending posting list.
+				 */
+				itup_ntuples = BTreeTupleIsPosting(itup) ?
+					BTreeTupleGetNPosting(itup) : 1;
+
+				if (dedupState->maxitemsize >
+					MAXALIGN(((IndexTupleSize(dedupState->itupprev)
+							   + (dedupState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+				{
+					_bt_add_posting_item(dedupState, itup);
+				}
+				else
+				{
+					insert_itupprev_to_page(newpage, dedupState);
+				}
+			}
+			else
+			{
+				/* Insert pending posting list on page */
+				insert_itupprev_to_page(newpage, dedupState);
+			}
+		}
+
+		/*
+		 * Copy the tuple into temp variable itupprev to compare it with the
+		 * following tuple and maybe unite them into a posting tuple
+		 */
+		if (dedupState->itupprev)
+			pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+
+		Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	}
+
+	/* Handle the last item. */
+	insert_itupprev_to_page(newpage, dedupState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
+ * Add new item to the page, while deduplicating
+ */
+static void
+insert_itupprev_to_page(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 18c6de21c1..55344a7d78 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..ea7ff6a5f9 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1431,41 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed24c5..fb976cad92 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer iptr,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ItemPointer iptr,
+									   IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -347,12 +355,13 @@ _bt_binsrch(Relation rel,
 	int32		result,
 				cmpval;
 
-	/* Requesting nextkey semantics while using scantid seems nonsensical */
-	Assert(!key->nextkey || key->scantid == NULL);
-
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+	Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
 	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
 
@@ -432,7 +441,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -507,6 +519,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -526,6 +549,60 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key,
+					Page page,
+					OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -535,9 +612,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -561,6 +647,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -595,7 +682,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -711,8 +797,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1449,6 +1551,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1483,8 +1586,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1517,7 +1642,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1525,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1567,8 +1692,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1582,8 +1736,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1596,6 +1750,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1608,6 +1764,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/*
+		 * Have index-only scans return the same truncated IndexTuple for
+		 * every logical tuple that originates from the same posting list
+		 */
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..b2a2039a3d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -963,6 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1051,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1127,6 +1136,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTDedupState *dedupState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+		if (BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+		else
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+	}
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* if tuple is posting, add all its TIDs to the posting list */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+	else
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+}
+
 /*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
@@ -1141,9 +1235,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1362,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_add_posting_item(dedupState, itup);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1a19..13c767164d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid.
+	 * Note that this handles posting list tuples by setting scantid to the
+	 * lowest heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1787,7 +1796,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* Never mark line pointers for posting list tuples */
+			if (!BTreeTupleIsPosting(ituple) &&
+				(ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2145,6 +2156,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2190,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft));
+		Assert(!BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2199,26 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+
+		Assert(!BTreeTupleIsPosting(pivot));
+	}
 	else
 	{
 		/*
@@ -2175,7 +2226,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2205,7 +2257,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2268,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2286,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2295,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2386,18 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2422,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2415,7 +2513,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * Non-pivot tuples currently never use alternative heap TID
 			 * representation -- even those within heapkeyspace indexes
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
@@ -2470,7 +2568,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+				(!BTreeTupleIsPivot(itup) &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
@@ -2497,7 +2595,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
 		return false;
 
 	/*
@@ -2567,11 +2665,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..d4d7c09ff0 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -178,12 +178,34 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 	{
 		Size		datalen;
 		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+		IndexTuple	nposting = NULL;
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingsz > 0)
+		{
+			IndexTuple	oposting;
+
+			Assert(isleaf);
+
+			/* oposting must be at offset before new item */
+			oposting = (IndexTuple) PageGetItem(page,
+												PageGetItemId(page, OffsetNumberPrev(xlrec->offnum)));
+			if (PageAddItem(page, (Item) datapos, xlrec->postingsz,
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+			nposting = (IndexTuple) (datapos + xlrec->postingsz);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,9 +287,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
-					left_hikeysz = 0;
+					left_hikeysz = 0,
+					npostingsz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -281,6 +305,17 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			datalen -= newitemsz;
 		}
 
+		if (xlrec->replacepostingoff)
+		{
+			Assert(xlrec->replacepostingoff ==
+				   OffsetNumberPrev(xlrec->newitemoff));
+
+			nposting = (IndexTuple) datapos;
+			npostingsz = MAXALIGN(IndexTupleSize(nposting));
+			datapos += npostingsz;
+			datalen -= npostingsz;
+		}
+
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
 		left_hikey = (IndexTuple) datapos;
 		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
@@ -304,6 +339,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == xlrec->replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, npostingsz,
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -386,8 +430,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +522,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..6f71b13199 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingsz %u",
+								 xlrec->offnum, xlrec->postingsz);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,6 +39,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
+				/* FIXME: even master doesn't have newitemoff */
 				appendStringInfo(buf, "level %u, firstright %d",
 								 xlrec->level, xlrec->firstright);
 				break;
@@ -46,8 +48,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..a3dec41f0a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,144 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		BTreeTupleSetBtIsPosting(itup); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeTupleSetPostingOffset(itup, offset) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		BTreeTupleSetPostingOffset(itup, off); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID.  The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID().  The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+		( \
+			(ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+		) \
+		: \
+		(ItemPointer) &((itup)->t_tid) \
+	)
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +478,8 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
@@ -335,6 +488,7 @@ typedef struct BTMetaPageData
 	)
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
+		Assert(!BTreeTupleIsPosting(itup)); \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
@@ -342,6 +496,8 @@ typedef struct BTMetaPageData
 /*
  * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
  * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
  */
 #define BTreeTupleGetHeapTID(itup) \
 	( \
@@ -351,7 +507,10 @@ typedef struct BTMetaPageData
 		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
 					   sizeof(ItemPointerData)) \
 	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+		(ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+		: (ItemPointer) &((itup)->t_tid) \
 	)
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +519,7 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
 		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -499,6 +659,12 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.
+	 */
+	int			in_posting_offset;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +700,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +731,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +750,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -762,6 +934,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -824,5 +1001,6 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTDedupState *dedupState, IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614da25..daa931377f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,26 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingsz is not 0, data also contains 'nposting' -
+ *				 tuple to replace original.
+ *
+ *				 TODO probably it would be enough to keep just a flag to point
+ *				 out that data contains 'nposting' and compute its offset as
+ *				 we know it follows the tuple, but I am afraid that it will
+ *				 break alignment, will it?
+ *
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	uint32		postingsz;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingsz) + sizeof(uint32))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -96,6 +106,12 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'nposting',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -113,9 +129,10 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (if placed on left page) */
+	OffsetNumber replacepostingoff; /* offset of the posting item to replace */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, replacepostingoff) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +190,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
-- 
2.17.1

#79

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#78)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Sep 2, 2019 at 6:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attach is v10, which fixes the Valgrind issue.

Attached is v11, which makes the kill_prior_tuple optimization work
with posting list tuples. The only catch is that it can only work when
all "logical tuples" within a posting list are known-dead, since of
course there is only one LP_DEAD bit available for each posting list.

The hardest part of this kill_prior_tuple work was writing the new
_bt_killitems() code, which I'm still not 100% happy with. Still, it
seems to work well -- new pageinspect LP_DEAD status info was added to
the second patch to verify that we're setting LP_DEAD bits as needed
for posting list tuples. I also had to add a new nbtree-specific,
posting-list-aware version of index_compute_xid_horizon_for_tuples()
-- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
avoid splitting a posting list with the LP_DEAD bit set. I took a
naive approach to avoiding that problem, adding code to
_bt_findinsertloc() to prevent it. Posting list splits are generally
assumed to be rare, so the fact that this is slightly inefficient
should be fine IMV.

I also refactored deduplication itself in anticipation of making the
WAL logging more efficient, and incremental. So, the structure of the
code within _bt_dedup_one_page() was simplified, without really
changing it very much (I think). I also fixed a bug in
_bt_dedup_one_page(). The check for dead items was broken in previous
versions, because the loop examined the high key tuple in every
iteration.

Making _bt_dedup_one_page() more efficient and incremental is still
the most important open item for the patch.

--
Peter Geoghegan

Attachments:

v11-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v11-0001-Add-deduplication-to-nbtree.patchDownload

From c07e06ff1ee2a0c595cdf773546c69940db73dd6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v11 1/2] Add deduplication to nbtree.

---
 contrib/amcheck/verify_nbtree.c         | 128 +++++--
 src/backend/access/nbtree/README        |  76 +++-
 src/backend/access/nbtree/nbtinsert.c   | 462 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     | 148 +++++++-
 src/backend/access/nbtree/nbtree.c      | 147 ++++++--
 src/backend/access/nbtree/nbtsearch.c   | 247 ++++++++++++-
 src/backend/access/nbtree/nbtsort.c     | 219 ++++++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  47 ++-
 src/backend/access/nbtree/nbtutils.c    | 264 ++++++++++++--
 src/backend/access/nbtree/nbtxlog.c     |  88 ++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  10 +-
 src/include/access/nbtree.h             | 242 +++++++++++--
 src/include/access/nbtxlog.h            |  36 +-
 src/tools/valgrind.supp                 |  21 ++
 14 files changed, 1957 insertions(+), 178 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..399743d4d6 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..50ec9ef48c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters).  When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed.  Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..bef5958465 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple nposting);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size itemsz);
+static void _bt_dedup_insert(Page page, BTDedupState *dedupState);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +947,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +962,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +981,14 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1002,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1014,72 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+		int			nipd;
+		char	   *replacepos;
+		char	   *rightpos;
+		Size		nbytes;
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPosting(oposting));
+		nipd = BTreeTupleGetNPosting(oposting);
+		Assert(in_posting_offset < nipd);
+
+		nposting = CopyIndexTuple(oposting);
+		replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+		rightpos = replacepos + sizeof(ItemPointerData);
+		nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+		/*
+		 * Move item pointers in posting list to make a gap for the new item's
+		 * heap TID (shift TIDs one place to the right, losing original
+		 * rightmost TID).
+		 */
+		memmove(rightpos, replacepos, nbytes);
+
+		/*
+		 * Replace newitem's heap TID with rightmost heap TID from original
+		 * posting list
+		 */
+		ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+		/*
+		 * Copy original (not new original) posting list's last TID into new
+		 * item
+		 */
+		ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+		Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								  BTreeTupleGetHeapTID(itup)) < 0);
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1112,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 nposting);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1192,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place update of
+			 * the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1245,9 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingsz = 0;
+			if (nposting)
+				xlrec.postingsz = MAXALIGN(IndexTupleSize(itup));
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1285,9 @@ _bt_insertonpg(Relation rel,
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (nposting)
+				XLogRegisterBufData(0, (char *) nposting,
+									IndexTupleSize(nposting));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1329,10 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
 }
 
 /*
@@ -1211,10 +1350,16 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple nposting)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1381,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1426,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1500,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1536,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1646,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1652,6 +1833,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.replacepostingoff = replacepostingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1676,6 +1858,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
+		if (replacepostingoff)
+			XLogRegisterBufData(0, (char *) nposting,
+								MAXALIGN(IndexTupleSize(nposting)));
+
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
@@ -1834,7 +2020,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2490,250 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= itemsz)
+		{
+			pfree(dedupState);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (dedupState->itupprev == NULL)
+		{
+			/* Just set up base/first item in first iteration */
+			Assert(offnum == minoff);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			continue;
+		}
+
+		if (deduplicate &&
+			_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+		{
+			int			itup_ntuples;
+			Size		projpostingsz;
+
+			/*
+			 * Tuples are equal.
+			 *
+			 * If posting list does not exceed tuple size limit then append
+			 * the tuple to the pending posting list.  Otherwise, insert it on
+			 * page and continue with this tuple as new pending posting list.
+			 */
+			itup_ntuples = BTreeTupleIsPosting(itup) ?
+				BTreeTupleGetNPosting(itup) : 1;
+
+			/*
+			 * Project size of new posting list that would result from merging
+			 * current tup with pending posting list (could just be prev item
+			 * that's "pending").
+			 *
+			 * This accounting looks odd, but it's correct because ...
+			 */
+			projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+									 (dedupState->ntuples + itup_ntuples + 1) *
+									 sizeof(ItemPointerData));
+
+			if (projpostingsz <= dedupState->maxitemsize)
+				_bt_stash_item_tid(dedupState, itup);
+			else
+				_bt_dedup_insert(newpage, dedupState);
+		}
+		else
+		{
+			/*
+			 * Tuples are not equal, or we're done deduplicating this page.
+			 *
+			 * Insert pending posting list on page.  This could just be a
+			 * regular tuple.
+			 */
+			_bt_dedup_insert(newpage, dedupState);
+		}
+
+		pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+	}
+
+	/* Handle the last item */
+	_bt_dedup_insert(newpage, dedupState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+static void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (dedupState->ntuples == 0)
+	{
+		/*
+		 * Use original itupprev, which may or may not be a posting list
+		 * already from some earlier dedup attempt
+		 */
+		to_insert = dedupState->itupprev;
+	}
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..5314bbe2a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..67595319d7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1431,41 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c78c8e67b5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer iptr,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ItemPointer iptr,
+									   IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1774,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/*
+		 * Have index-only scans return the same truncated IndexTuple for
+		 * every logical tuple that originates from the same posting list
+		 */
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..a2484f3e3b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1127,6 +1138,112 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+		/*
+		 * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * itupprev would be merged with the next tuple, or stay as-is.  By
+		 * now caller compared it against itup and found that it was equal, so
+		 * we can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+		else
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+	}
+
+	/*
+	 * Add current tup to ipd for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+}
+
 /*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
@@ -1141,9 +1258,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1385,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_stash_item_tid(dedupState, itup);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0966..e3d7f4ff0e 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2145,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2175,7 +2247,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2193,6 +2266,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2205,7 +2279,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2290,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2308,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2317,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2321,15 +2399,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2442,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2407,22 +2525,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2466,12 +2592,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2497,7 +2623,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2567,11 +2697,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..d4d7c09ff0 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -178,12 +178,34 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 	{
 		Size		datalen;
 		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+		IndexTuple	nposting = NULL;
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingsz > 0)
+		{
+			IndexTuple	oposting;
+
+			Assert(isleaf);
+
+			/* oposting must be at offset before new item */
+			oposting = (IndexTuple) PageGetItem(page,
+												PageGetItemId(page, OffsetNumberPrev(xlrec->offnum)));
+			if (PageAddItem(page, (Item) datapos, xlrec->postingsz,
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+			nposting = (IndexTuple) (datapos + xlrec->postingsz);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,9 +287,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
-					left_hikeysz = 0;
+					left_hikeysz = 0,
+					npostingsz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -281,6 +305,17 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			datalen -= newitemsz;
 		}
 
+		if (xlrec->replacepostingoff)
+		{
+			Assert(xlrec->replacepostingoff ==
+				   OffsetNumberPrev(xlrec->newitemoff));
+
+			nposting = (IndexTuple) datapos;
+			npostingsz = MAXALIGN(IndexTupleSize(nposting));
+			datapos += npostingsz;
+			datalen -= npostingsz;
+		}
+
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
 		left_hikey = (IndexTuple) datapos;
 		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
@@ -304,6 +339,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == xlrec->replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, npostingsz,
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -386,8 +430,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +522,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..6f71b13199 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingsz %u",
+								 xlrec->offnum, xlrec->postingsz);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,6 +39,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
+				/* FIXME: even master doesn't have newitemoff */
 				appendStringInfo(buf, "level %u, firstright %d",
 								 xlrec->level, xlrec->firstright);
 				break;
@@ -46,8 +48,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..3aa09744e0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,118 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +452,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -499,6 +658,13 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +700,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +731,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +750,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -762,6 +934,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -824,5 +1001,6 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614da25..daa931377f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,26 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingsz is not 0, data also contains 'nposting' -
+ *				 tuple to replace original.
+ *
+ *				 TODO probably it would be enough to keep just a flag to point
+ *				 out that data contains 'nposting' and compute its offset as
+ *				 we know it follows the tuple, but I am afraid that it will
+ *				 break alignment, will it?
+ *
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	uint32		postingsz;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingsz) + sizeof(uint32))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -96,6 +106,12 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'nposting',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -113,9 +129,10 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (if placed on left page) */
+	OffsetNumber replacepostingoff; /* offset of the posting item to replace */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, replacepostingoff) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +190,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
-- 
2.17.1

v11-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v11-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 3e6bd467c0a784962af6c1b00ac5563765901a6d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v11 2/2] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples).  Also adds a column that shows whether or not the
LP_DEAD bit has been set.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 91 ++++++++++++++++---
 contrib/pageinspect/expected/btree.out        |  6 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
 3 files changed, 108 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..b3ea978117 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[10];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer min_htid, max_htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +286,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (rel && !_bt_heapkeyspace(rel))
+	{
+		min_htid = NULL;
+		max_htid = NULL;
+	}
+	else
+	{
+		min_htid = BTreeTupleGetHeapTID(itup);
+		if (BTreeTupleIsPosting(itup))
+			max_htid = BTreeTupleGetMaxTID(itup);
+		else
+			max_htid = NULL;
+	}
+
+	if (min_htid)
+		values[j++] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(min_htid),
+							 ItemPointerGetOffsetNumberNoCheck(min_htid));
+	else
+		values[j++] = NULL;
+
+	if (max_htid)
+		values[j++] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(max_htid),
+							 ItemPointerGetOffsetNumberNoCheck(max_htid));
+	else
+		values[j++] = NULL;
+
+	if (min_htid == NULL)
+		values[j++] = psprintf("0");
+	else if (!BTreeTupleIsPosting(itup))
+		values[j++] = psprintf("1");
+	else
+		values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+	if (!ItemIdIsDead(id))
+		values[j++] = psprintf("f");
+	else
+		values[j++] = psprintf("t");
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +430,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +461,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +547,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
+max_htid   | 
+nheap_tids | 1
+isdead     | f
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid,
+    OUT max_htid tid,
+    OUT nheap_tids int4,
+    OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

#80

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#79)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

09.09.2019 22:54, Peter Geoghegan wrote:

Attached is v11, which makes the kill_prior_tuple optimization work
with posting list tuples. The only catch is that it can only work when
all "logical tuples" within a posting list are known-dead, since of
course there is only one LP_DEAD bit available for each posting list.

The hardest part of this kill_prior_tuple work was writing the new
_bt_killitems() code, which I'm still not 100% happy with. Still, it
seems to work well -- new pageinspect LP_DEAD status info was added to
the second patch to verify that we're setting LP_DEAD bits as needed
for posting list tuples. I also had to add a new nbtree-specific,
posting-list-aware version of index_compute_xid_horizon_for_tuples()
-- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
avoid splitting a posting list with the LP_DEAD bit set. I took a
naive approach to avoiding that problem, adding code to
_bt_findinsertloc() to prevent it. Posting list splits are generally
assumed to be rare, so the fact that this is slightly inefficient
should be fine IMV.

I also refactored deduplication itself in anticipation of making the
WAL logging more efficient, and incremental. So, the structure of the
code within _bt_dedup_one_page() was simplified, without really
changing it very much (I think). I also fixed a bug in
_bt_dedup_one_page(). The check for dead items was broken in previous
versions, because the loop examined the high key tuple in every
iteration.

Making _bt_dedup_one_page() more efficient and incremental is still
the most important open item for the patch.

Hi, thank you for the fixes and improvements.
I reviewed them and everything looks good except the idea of not
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be
resolved
by subsequent visibility checks. Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.

Attached is v12, which contains WAL optimizations for posting split and
page
deduplication. Changes to prior version:

* xl_btree_split record doesn't contain posting tuple anymore, instead
it keeps
'in_posting offset' and repeats the logic of _bt_insertonpg() as you
proposed
upthread.

* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
info about
groups of tuples deduplicated into posting tuples. In principle, it is
possible
to fit it into some existing record, but I preferred to keep things clear.

I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?

* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
have
very few duplicates or pages that are already full of posting lists.

To be honest, I don't believe that incremental deduplication can really
improve
something, because no matter how many items were compressed we still
rewrite
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v12-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v12-0001-Add-deduplication-to-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..399743d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..50ec9ef 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters).  When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed.  Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..8fb17d6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple nposting,
+						OffsetNumber in_posting_offset);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size itemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset)
+{
+	int			nipd;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nipd = BTreeTupleGetNPosting(oposting);
+	Assert(in_posting_offset < nipd);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original
+	 * rightmost TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new
+	 * item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								BTreeTupleGetHeapTID(itup)) < 0);
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,14 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1050,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1063,42 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1130,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 nposting, in_posting_offset);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1210,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place update of
+			 * the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1263,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.in_posting_offset = in_posting_offset;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1301,9 @@ _bt_insertonpg(Relation rel,
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (nposting)
+				XLogRegisterBufData(0, (char *) nposting,
+									IndexTupleSize(nposting));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1345,10 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
 }
 
 /*
@@ -1211,10 +1366,16 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple nposting, OffsetNumber in_posting_offset)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1397,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1405,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1273,6 +1442,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1516,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1552,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1662,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1653,6 +1850,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
 
+		/*
+		 * If replacing posting item was put on the right page,
+		 * we don't need to explicitly WAL log it because it's included
+		 * with all the other items on the right page.
+		 * Otherwise, save in_posting_offset and newitem to construct
+		 * replacing tuple.
+		 */
+		xlrec.in_posting_offset = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.in_posting_offset = in_posting_offset;
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
 
@@ -1672,8 +1880,11 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * is not stored if XLogInsert decides it needs a full-page image of
 		 * the left page.  We store the offset anyway, though, to support
 		 * archive compression of these records.
+		 *
+		 * Also save newitem in case posting split was required
+		 * to construct new posting.
 		 */
-		if (newitemonleft)
+		if (newitemonleft || xlrec.in_posting_offset)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
 		/* Log the left page's new high key */
@@ -1834,7 +2045,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2515,277 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
+	 */
+}
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= itemsz)
+		{
+			pfree(dedupState);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
 	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (dedupState->itupprev == NULL)
+		{
+			/* Just set up base/first item in first iteration */
+			Assert(offnum == minoff);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+			continue;
+		}
+
+		if (deduplicate &&
+			_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+		{
+			int			itup_ntuples;
+			Size		projpostingsz;
+
+			/*
+			 * Tuples are equal.
+			 *
+			 * If posting list does not exceed tuple size limit then append
+			 * the tuple to the pending posting list.  Otherwise, insert it on
+			 * page and continue with this tuple as new pending posting list.
+			 */
+			itup_ntuples = BTreeTupleIsPosting(itup) ?
+				BTreeTupleGetNPosting(itup) : 1;
+
+			/*
+			 * Project size of new posting list that would result from merging
+			 * current tup with pending posting list (could just be prev item
+			 * that's "pending").
+			 *
+			 * This accounting looks odd, but it's correct because ...
+			 */
+			projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+									 (dedupState->ntuples + itup_ntuples + 1) *
+									 sizeof(ItemPointerData));
+
+			if (projpostingsz <= dedupState->maxitemsize)
+				_bt_stash_item_tid(dedupState, itup, offnum);
+			else
+				_bt_dedup_insert(newpage, dedupState);
+		}
+		else
+		{
+			/*
+			 * Tuples are not equal, or we're done deduplicating this page.
+			 *
+			 * Insert pending posting list on page.  This could just be a
+			 * regular tuple.
+			 */
+			_bt_dedup_insert(newpage, dedupState);
+		}
+
+		pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+		dedupState->itupprev_off = offnum;
+
+		Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	}
+
+	/* Handle the last item */
+	_bt_dedup_insert(newpage, dedupState);
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (dedupState->n_intervals == 0)
+	{
+		pfree(dedupState);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.n_intervals =  dedupState->n_intervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/* only save non-empthy part of the array */
+		if (dedupState->n_intervals > 0)
+			XLogRegisterData((char *) dedupState->dedup_intervals,
+							 dedupState->n_intervals * sizeof(dedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (dedupState->ntuples == 0)
+	{
+		/*
+		 * Use original itupprev, which may or may not be a posting list
+		 * already from some earlier dedup attempt
+		 */
+		to_insert = dedupState->itupprev;
+	}
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..5314bbe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 }
 
 /*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
+/*
  * Delete item(s) from a btree page during single-page cleanup.
  *
  * As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..6759531 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1432,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..c78c8e6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer iptr,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ItemPointer iptr,
+									   IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -529,6 +552,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 }
 
 /*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
+/*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
  *	page/offnum: location of btree item to be compared to.
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1611,6 +1775,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/*
+		 * Have index-only scans return the same truncated IndexTuple for
+		 * every logical tuple that originates from the same posting list
+		 */
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+	}
+}
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..4198770 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1128,6 +1139,136 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+				   OffsetNumber itup_offnum)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+		/*
+		 * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * itupprev would be merged with the next tuple, or stay as-is.  By
+		 * now caller compared it against itup and found that it was equal, so
+		 * we can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+		else
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+
+		/* Save info about deduplicated items for future xlog record */
+		dedupState->n_intervals++;
+		/* Save offnum of the first item belongin to the group */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+		/*
+		 * Update the number of deduplicated items, belonging to this group.
+		 * Count each item just once, no matter if it was posting tuple or not
+		 */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+	}
+
+	/*
+	 * Add current tup to ipd for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+
+	/*
+	 * Update the number of deduplicated items, belonging to this group.
+	 * Count each item just once, no matter if it was posting tuple or not
+	 */
+	dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+	/* TODO just a debug message. delete it in final version of the patch */
+	if (itup_offnum != InvalidOffsetNumber)
+		elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+				dedupState->n_intervals,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1141,9 +1282,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1409,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0..e3d7f4f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2145,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2175,7 +2247,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2193,6 +2266,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2205,7 +2279,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2290,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2308,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2317,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2321,15 +2399,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2442,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2407,22 +2525,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2466,12 +2592,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2497,7 +2623,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2567,11 +2697,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..de9bc3b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->in_posting_offset != InvalidOffsetNumber)
+		{
+			/* oposting must be at offset before new item */
+			ItemId		itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+			IndexTuple newitem = (IndexTuple) datapos;
+			IndexTuple nposting;
+
+			nposting = _bt_form_newposting(newitem, oposting,
+										   xlrec->in_posting_offset);
+			Assert(isleaf);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+
+			/* replace existing posting */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +291,43 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					 replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->in_posting_offset)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			/*
+			 * Repeat logic implemented in _bt_insertonpg():
+			 *
+			 * If the new tuple is a duplicate with a heap TID that falls
+			 * inside the range of an existing posting list tuple,
+			 * generate new posting tuple to replace original one
+			 * and update new tuple so that it's heap TID contains
+			 * the rightmost heap TID of original posting tuple.
+			 */
+			if (xlrec->in_posting_offset)
+			{
+				ItemId		itemid = PageGetItemId(lpage, xlrec->newitemoff);
+				IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+				nposting = _bt_form_newposting(newitem, oposting,
+											   xlrec->in_posting_offset);
+				/* Alter new item offset, since effective new item changed */
+				xlrec->newitemoff = OffsetNumberNext(xlrec->newitemoff);
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +353,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -380,14 +438,146 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 }
 
 static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	Page		newpage;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items
+		 * to that in item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque nopaque;
+		OffsetNumber offnum, minoff, maxoff;
+		BTDedupState *dedupState = NULL;
+		char *data = ((char *) xlrec + SizeOfBtreeDedup);
+		dedupInterval dedup_intervals[MaxOffsetNumber];
+		int			 nth_interval = 0;
+		OffsetNumber n_dedup_tups = 0;
+
+		dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+		dedupState->ipd = NULL;
+		dedupState->ntuples = 0;
+		dedupState->itupprev = NULL;
+		dedupState->maxitemsize = BTMaxItemSize(page);
+		dedupState->maxpostingsize = 0;
+
+		memcpy(dedup_intervals, data,
+			   xlrec->n_intervals*sizeof(dedupInterval));
+
+		/* Scan over all items to see which ones can be deduplicated */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+		nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		/* Make sure that new page won't have garbage flag set */
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+		/* Copy High Key if any */
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		/*
+		* Iterate over tuples on the page to deduplicate them into posting
+		* lists and insert into new page
+		*/
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+			elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+			if (dedupState->itupprev == NULL)
+			{
+				/* Just set up base/first item in first iteration */
+				Assert(offnum == minoff);
+				dedupState->itupprev = CopyIndexTuple(itup);
+				dedupState->itupprev_off = offnum;
+				continue;
+			}
+
+			/*
+			 * Instead of comparing tuple's keys, which may be costly, use
+			 * information from xlog record. If current tuple belongs to the
+			 * group of deduplicated items, repeat logic of _bt_dedup_one_page
+			 * and stash it to form a posting list afterwards.
+			 */
+			if (dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+				&& n_dedup_tups < dedup_intervals[nth_interval].ntups)
+			{
+				_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+				elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+				/* count first tuple in the group */
+				if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+					n_dedup_tups++;
+
+				/* count added tuple */
+				n_dedup_tups++;
+			}
+			else
+			{
+				_bt_dedup_insert(newpage, dedupState);
+
+				/* reset state */
+				if (n_dedup_tups > 0)
+					nth_interval++;
+				n_dedup_tups = 0;
+			}
+
+			pfree(dedupState->itupprev);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+		}
+
+		/* Handle the last item */
+		_bt_dedup_insert(newpage, dedupState);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
+static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +668,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
+
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -838,6 +1048,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..802e27b 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; in_posting_offset %u",
+								 xlrec->offnum, xlrec->in_posting_offset);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
+				/* FIXME: even master doesn't have newitemoff */
 				appendStringInfo(buf, "level %u, firstright %d",
 								 xlrec->level, xlrec->firstright);
 				break;
 			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "items were deduplicated to %d items",
+								 xlrec->n_intervals);
+				break;
+			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +143,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..d1af18f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,145 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+	OffsetNumber from;
+	OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+
+	/*
+	 * array with info about deduplicated items on the page.
+	 *
+	 * It contains one entry for each group of consecutive items that
+	 * were deduplicated into a single posting tuple.
+	 *
+	 * This array is saved to xlog entry, which allows to replay
+	 * deduplication faster without actually comparing tuple's keys.
+	 */
+	dedupInterval dedup_intervals[MaxOffsetNumber];
+	/* current number of items in dedup_intervals array */
+	int			n_intervals;
+	/* temp state variable to keep a 'possible' start of dedup interval */
+	OffsetNumber itupprev_off;
+
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
 
-/* Get/set downlink block number */
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +479,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +686,13 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -534,7 +727,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +758,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +777,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -732,6 +931,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset);
+extern void _bt_dedup_insert(Page page, BTDedupState *dedupState);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +964,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1016,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -824,5 +1031,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+							   OffsetNumber itup_offnum);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..075baaf 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* compactify tuples on the page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if in_posting_offset is valid, this is an insertion
+ *				 into existing posting tuple at offnum.
+ *				 redo must repeat logic of bt_insertonpg().
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber in_posting_offset;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -96,6 +102,11 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -113,9 +124,26 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (if placed on left page) */
+	OffsetNumber in_posting_offset; /* offset inside posting tuple  */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+	int			n_intervals;
+
+	/* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +201,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}

#81

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#80)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I reviewed them and everything looks good except the idea of not
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be
resolved
by subsequent visibility checks.

Fair enough, but I didn't do that because it's compelling on its own
-- it isn't. I did it because it seemed like the best way to handle
posting list splits in a version of the patch where LP_DEAD bits can
be set on posting list tuples. I think that we have 3 high level
options here:

1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting
lists at all. This is clearly the easiest approach.

2. We do what I did in v11 of the patch -- we make it so that
_bt_insertonpg() and _bt_split() never have to deal with LP_DEAD
posting lists that they must split in passing.

3. We add additional code to _bt_insertonpg() and _bt_split() to deal
with the rare case where they must split an LP_DEAD posting list,
probably by unsetting the bit or something like that. Obviously it
would be wrong to leave the LP_DEAD bit set for the newly inserted
heap tuples TID that must go in a posting list that had its LP_DEAD
bit set -- that would make it dead to index scans even after its xact
successfully committed.

I think that you already agree that we want to have the
kill_prior_tuple optimizations with posting lists, so #1 isn't really
an option. That just leaves #2 and #3. Since posting list splits are
already assumed to be quite rare, it seemed far simpler to take the
conservative approach of forcing clean-up that removes LP_DEAD bits so
that _bt_insertonpg() and _bt_split() don't have to think about it.
Obviously I think it's important that we make as few changes as
possible to _bt_insertonpg() and _bt_split(), in general.

I don't understand what you mean about visibility checks. There is
nothing truly special about the way in which _bt_findinsertloc() will
sometimes have to kill LP_DEAD items so that _bt_insertonpg() and
_bt_split() don't have to think about LP_DEAD posting lists. As far as
recovery is concerned, it is just another XLOG_BTREE_DELETE record,
like any other. Note that there is a second call to
_bt_binsrch_insert() within _bt_findinsertloc() when it has to
generate a new XLOG_BTREE_DELETE record (by calling
_bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that
isn't dependent on the BTP_HAS_GARBAGE status bit being set).

Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.

There is no question that the little snippet of code I added to
_bt_killitems() in v11 is still too complicated. We also have to
consider cases where the array overflows because the scan direction
was changed (see the kill_prior_tuple comment block in btgetuple()).
Yeah, it's messy.

Attached is v12, which contains WAL optimizations for posting split and
page
deduplication.

Cool.

* xl_btree_split record doesn't contain posting tuple anymore, instead
it keeps
'in_posting offset' and repeats the logic of _bt_insertonpg() as you
proposed
upthread.

That looks good.

* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
info about
groups of tuples deduplicated into posting tuples. In principle, it is
possible
to fit it into some existing record, but I preferred to keep things clear.

I definitely think that inventing a new WAL record was the right thing to do.

I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?

I don't know about the regression tests (I doubt that there is a
natural place for such a test), but I came up with a rough test case.
I more or less copied the approach that you took with the index build
WAL reduction patches, though I also figured out a way of subtracting
heapam WAL overhead to get a real figure. I attach the test case --
note that you'll need to use the "land" database with this. (This test
case might need to be improved, but it's a good start.)

* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
have
very few duplicates or pages that are already full of posting lists.

I think that the BT_COMPRESS_THRESHOLD idea is closely related to
making _bt_dedup_one_page() behave incrementally.

On my machine, v12 of the patch actually uses slightly more WAL than
v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of
nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL
compression, so if I turned that off v12 would probably win by a small
amount). Both numbers are wildly excessive, though. The master branch
figure is only 2011 MB, which is only about 1.8x the size of the index
on the master branch. And this is for a test case that makes the index
6.5x smaller, so the gap between total index size and total WAL volume
is huge here -- the volume of WAL is nearly 40x greater than the index
size!

You are right to wonder what the result would be if we put
BT_COMPRESS_THRESHOLD back in. It would probably significantly reduce
the volume of WAL, because _bt_dedup_one_page() would no longer
"thrash". However, I strongly suspect that that wouldn't be good
enough at reducing the WAL volume down to something acceptable. That
will require an approach to WAL-logging that is much more logical than
physical. The nbtree_wal_test.sql test case involves a case where page
splits mostly don't WAL-log things that were previously WAL-logged by
simple inserts, because nbtsplitloc.c has us split in a right-heavy
fashion when there are lots of duplicates. In other words, the
_bt_split() optimization to WAL volume naturally works very well with
the test case, or really any case with lots of duplicates, so the
"write amplification" to the total volume of WAL is relatively small
on the master branch.

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally. Otherwise you end up
needlessly rewriting things that didn't actually change way too often.
That's definitely not okay. Why worry about bringing 40x down to 20x,
or even 10x? It needs to be comparable to the master branch.

To be honest, I don't believe that incremental deduplication can really
improve
something, because no matter how many items were compressed we still
rewrite
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?

The point of being incremental is not to save work in cases where a
page split is inevitable anyway. Rather, the idea is that we can be
even more lazy, and avoid doing work that will never be needed --
maybe delaying page splits actually means preventing them entirely.
Or, we can spread out the work over time, so that the amount of WAL
per checkpoint is smoother than what we would get with a batch
approach. My mental model of page splits is that there are sometimes
many of them on the same page again and again in a very short time
period, but more often the chances of any individual page being split
is low. Even the rightmost page of a serial PK index isn't truly an
exception, because a new rightmost page isn't "the same page" as the
original rightmost page -- it is its new right sibling.

Since we're going to have to optimize the WAL logging anyway, it will
be relatively easy to experiment with incremental deduplication within
_bt_dedup_one_page(). The WAL logging is the the hard part, so let's
focus on that rather than worrying too much about whether or not
incrementally doing all the work (not just the WAL logging) makes
sense. It's still too early to be sure about whether or not that's a
good idea.

--
Peter Geoghegan

#82

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#80)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Attached is v12, which contains WAL optimizations for posting split and
page
deduplication.

Hmm. So v12 seems to have some problems with the WAL logging for
posting list splits. With wal_debug = on and
wal_consistency_checking='all', I can get a replica to fail
consistency checking very quickly when "make installcheck" is run on
the primary:

4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30423A0; LSN 0/30425A0:
prev 0/3041C78; xid 506; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 20 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30425A0; LSN 0/3042F78:
prev 0/30423A0; xid 506; len 4; blkref #0: rel 1663/16385/2673, blk 13
FPW - Btree/INSERT_LEAF: off 138; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3042F78; LSN 0/3043788:
prev 0/30425A0; xid 506; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 68; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043788; LSN 0/30437C0:
prev 0/3042F78; xid 506; len 28 - Transaction/ABORT: 2019-09-11
15:01:06.291717-07; rels: pg_tblspc/16388/PG_13_201909071/16385/16399
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30437C0; LSN 0/3043A30:
prev 0/3043788; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 9 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043A30; LSN 0/3043D08:
prev 0/30437C0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043D08; LSN 0/3044948:
prev 0/3043A30; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 169; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044948; LSN 0/3044B58:
prev 0/3043D08; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 21 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044B58; LSN 0/30454A0:
prev 0/3044948; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30454A0; LSN 0/3045CC0:
prev 0/3044B58; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 71; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045CC0; LSN 0/3045F48:
prev 0/30454A0; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 10 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045F48; LSN 0/3046240:
prev 0/3045CC0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046240; LSN 0/3046E70:
prev 0/3045F48; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 44; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046E70; LSN 0/3047090:
prev 0/3046240; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 22 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3047090; LSN 0/30479E0:
prev 0/3046E70; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30479E0; LSN 0/3048420:
prev 0/3047090; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 38
FPW - Btree/INSERT_LEAF: off 10; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048420; LSN 0/30486B0:
prev 0/30479E0; xid 507; len 3; blkref #0: rel 1663/16385/1259, blk 0
FPW - Heap/INSERT: off 6 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30486B0; LSN 0/3048C30:
prev 0/3048420; xid 507; len 4; blkref #0: rel 1663/16385/2662, blk 2
FPW - Btree/INSERT_LEAF: off 119; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048C30; LSN 0/3049668:
prev 0/30486B0; xid 507; len 4; blkref #0: rel 1663/16385/2663, blk 1
FPW - Btree/INSERT_LEAF: off 42; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3049668; LSN 0/304A550:
prev 0/3048C30; xid 507; len 4; blkref #0: rel 1663/16385/3455, blk 1
FPW - Btree/INSERT_LEAF: off 2; in_posting_offset 1
4448/2019-09-11 15:01:06 PDT FATAL: inconsistent page found, rel
1663/16385/3455, forknum 0, blkno 1
4448/2019-09-11 15:01:06 PDT CONTEXT: WAL redo at 0/3049668 for
Btree/INSERT_LEAF: off 2; in_posting_offset 1
4447/2019-09-11 15:01:06 PDT LOG: startup process (PID 4448) exited
with exit code 1
4447/2019-09-11 15:01:06 PDT LOG: terminating any other active server processes
4447/2019-09-11 15:01:06 PDT LOG: database system is shut down

I regularly use this test case for the patch -- I think that I fixed a
similar problem in v11, when I changed the same WAL logging, but I
didn't mention it until now. I will debug this myself in a few days,
though you may prefer to do it before then.

--
Peter Geoghegan

#83

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#82)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 11, 2019 at 3:09 PM Peter Geoghegan <pg@bowt.ie> wrote:

Hmm. So v12 seems to have some problems with the WAL logging for
posting list splits. With wal_debug = on and
wal_consistency_checking='all', I can get a replica to fail
consistency checking very quickly when "make installcheck" is run on
the primary

I see the bug here. The problem is that we WAL-log a version of the
new item that already has its heap TID changed. On the primary, the
call to _bt_form_newposting() has a new item with the original heap
TID, which is then rewritten before being inserted -- that's correct.
But during recovery, we *start out with* a version of the new item
that *already* had its heap TID swapped. So we have nowhere to get the
original heap TID from during recovery.

Attached patch fixes the problem in a hacky way -- it WAL-logs the
original heap TID, just in case. Obviously this fix isn't usable, but
it should make the problem clearer.

Can you come up with a proper fix, please? I can think of one way of
doing it, but I'll leave the details to you.

The same issue exists in _bt_split(), so the tests will still fail
with wal_consistency_checking -- it just takes a lot longer to reach a
point where an inconsistent page is found, because posting list splits
that occur at the same point that we need to split a page are much
rarer than posting list splits that occur when we simply need to
insert, without splitting the page. I suggest using
wal_consistency_checking to test the fix that you come up with. As I
mentioned, I regularly use it. Also note that there are further
subtleties to doing this within _bt_split() -- see the FIXME comments
there.

Thanks
--
Peter Geoghegan

Attachments:

0001-Save-original-new-heap-TID-in-insert-WAL-record.patchapplication/octet-stream; name=0001-Save-original-new-heap-TID-in-insert-WAL-record.patchDownload

From 8efe8f8f94d8f3195ba65b964799ca2c75f971fd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 11 Sep 2019 17:46:11 -0700
Subject: [PATCH] Save original new heap TID in insert WAL record.

---
 src/backend/access/nbtree/nbtinsert.c | 14 ++++++++++++++
 src/backend/access/nbtree/nbtxlog.c   |  3 +++
 src/include/access/nbtxlog.h          |  4 +++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 8fb17d6784..119e3fe5a6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1037,6 +1037,7 @@ _bt_insertonpg(Relation rel,
 	Size		itemsz;
 	IndexTuple	nposting = NULL;
 	IndexTuple	oposting;
+	ItemPointerData orig;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1061,6 +1062,7 @@ _bt_insertonpg(Relation rel,
 	itemsz = IndexTupleSize(itup);
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
+	memset(&orig, 0, sizeof(ItemPointerData));
 
 	/*
 	 * Do we need to split an existing posting list item?
@@ -1092,6 +1094,8 @@ _bt_insertonpg(Relation rel,
 		Assert(in_posting_offset > 0);
 		oposting = (IndexTuple) PageGetItem(page, itemid);
 
+		/* HACK Save orig heap TID for WAL logging */
+		ItemPointerCopy(&itup->t_tid, &orig);
 		nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
 
 		/* Alter new item offset, since effective new item changed */
@@ -1264,6 +1268,7 @@ _bt_insertonpg(Relation rel,
 
 			xlrec.offnum = itup_off;
 			xlrec.in_posting_offset = in_posting_offset;
+			xlrec.orig = orig;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1856,6 +1861,15 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * with all the other items on the right page.
 		 * Otherwise, save in_posting_offset and newitem to construct
 		 * replacing tuple.
+		 *
+		 * FIXME: The same "original new item TID vs. rewritten new item TID"
+		 * issue exists here, but I haven't done anything with that.
+		 *
+		 * FIXME: Be careful about splits where the new item is also the first
+		 * item on the right half -- that would make the posting list that we
+		 * have to update in-place the last item on the left.  This is hard to
+		 * test because nbtsplitloc.c will avoid choosing a split point
+		 * between these two.
 		 */
 		xlrec.in_posting_offset = InvalidOffsetNumber;
 		if (replacepostingoff < firstright)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index de9bc3b101..5bb38beda1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -189,6 +189,9 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 			IndexTuple newitem = (IndexTuple) datapos;
 			IndexTuple nposting;
 
+			/* Restore newitem to actual original state in _bt_insertonpg() */
+			newitem = CopyIndexTuple(newitem);
+			ItemPointerCopy(&xlrec->orig, &newitem->t_tid);
 			nposting = _bt_form_newposting(newitem, oposting,
 										   xlrec->in_posting_offset);
 			Assert(isleaf);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 075baaf6eb..2813e569dc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "storage/itemptr.h"
 #include "storage/off.h"
 
 /*
@@ -74,9 +75,10 @@ typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
 	OffsetNumber in_posting_offset;
+	ItemPointerData orig;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, orig) + sizeof(ItemPointerData))
 
 /*
  * On insert with split, we save all the items going into the right sibling
-- 
2.17.1

#84

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#81)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.

Attached is v13 of the patch, which shows what I mean. You could say
that v13 makes _bt_dedup_one_page() do a few extra things that are
kind of similar to the things that nbtsplitloc.c does for _bt_split().

More specifically, the v13-0001-* patch includes code that makes
_bt_dedup_one_page() "goal orientated" -- it calculates how much space
will be freed when _bt_dedup_one_page() goes on to deduplicate those
items on the page that it has already "decided to deduplicate". The
v13-0002-* patch makes _bt_dedup_one_page() actually use this ability
-- it makes _bt_dedup_one_page() give up on deduplication when it is
clear that the items that are already "pending deduplication" will
free enough space for its caller to at least avoid a page split. This
revision of the patch doesn't truly make deduplication incremental. It
is only a proof of concept that shows how _bt_dedup_one_page() can
*decide* that it will free "enough" space, whatever that may mean, so
that it can finish early. The task of making _bt_dedup_one_page()
actually avoid lots of work when it finishes early remains.

As I said yesterday, I'm not asking you to accept that v13-0002-* is
an improvement. At least not yet. In fact, "finishes early" due to the
v13-0002-* logic clearly makes everything a lot slower, since
_bt_dedup_one_page() will "thrash" even more than earlier versions of
the patch. This is especially problematic with WAL-logged relations --
the test case that I shared yesterday goes from about 6GB to 10GB with
v13-0002-* applied. But we need to fundamentally rethink the approach
to the rewriting + WAL-logging by _bt_dedup_one_page() anyway. (Note
that total index space utilization is barely affected by the
v13-0002-* patch, so clearly that much works well.)

Other changes:

* Small tweaks to amcheck (nothing interesting, really).

* Small tweaks to the _bt_killitems() stuff.

* Moved all of the deduplication helper functions to nbtinsert.c. This
is where deduplication gets complicated, so I think that it should all
live there. (i.e. nbtsort.c will call nbtinsert.c code, never the
other way around.)

Note that I haven't merged any of the changes from v12 of the patch
from yesterday. I didn't merge the posting list WAL logging changes
because of the bug I reported, but I would have were it not for that.
The WAL logging for _bt_dedup_one_page() added to v12 didn't appear to
be more efficient than your original approach (i.e. calling
log_newpage_buffer()), so I have stuck with your original approach.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.

--
Peter Geoghegan

Attachments:

v13-0002-Stop-deduplicating-when-a-page-split-is-avoided.patchapplication/octet-stream; name=v13-0002-Stop-deduplicating-when-a-page-split-is-avoided.patchDownload

From a7d4cafc92358e6095a48c0b42ccbe06b7b8bd5f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 12 Sep 2019 16:19:54 -0700
Subject: [PATCH v13 2/3] Stop deduplicating when a page split is avoided.

Currently this is a big loss for performance, especially with WAL-logged
relations, though it barely affects total space utilization compared to
recent versions of the patch.  With incremental rewriting of the page
and incremental WAL logging, this could actually be a win for
performance.

In any case it seems like a good thing for deduplication to be able to
operate in a "goal-orientated" way.  The exact details will need to be
validated by extensive benchmarking.
---
 src/backend/access/nbtree/nbtinsert.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 52651fcbe4..f3b945edf9 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -2675,6 +2675,21 @@ _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
 			pagesaving += _bt_dedup_insert(newpage, dedupState);
 		}
 
+		/*
+		 * When we have deduplicated enough to avoid page split, don't bother
+		 * deduplicating any more items.
+		 *
+		 * FIXME: If rewriting the page and doing the WAL logging were
+		 * incremental, we could actually break out of the loop and save real
+		 * work.  As things stand this is a loss for performance, but it
+		 * barely affects space utilization. (The number of blocks are the
+		 * same as before, except for rounding effects.  The minimum number of
+		 * items on each page for each index "increases" when this is enabled,
+		 * however.)
+		 */
+		if (pagesaving >= newitemsz)
+			deduplicate = false;
+
 		pfree(dedupState->itupprev);
 		dedupState->itupprev = CopyIndexTuple(itup);
 	}
-- 
2.17.1

v13-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v13-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 711db4cd083528bb9c39cd66ed9faee0141e108a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v13 3/3] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples).  Also adds a column that shows whether or not the
LP_DEAD bit has been set.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 91 ++++++++++++++++---
 contrib/pageinspect/expected/btree.out        |  6 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
 3 files changed, 108 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..b3ea978117 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[10];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer min_htid, max_htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +286,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (rel && !_bt_heapkeyspace(rel))
+	{
+		min_htid = NULL;
+		max_htid = NULL;
+	}
+	else
+	{
+		min_htid = BTreeTupleGetHeapTID(itup);
+		if (BTreeTupleIsPosting(itup))
+			max_htid = BTreeTupleGetMaxTID(itup);
+		else
+			max_htid = NULL;
+	}
+
+	if (min_htid)
+		values[j++] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(min_htid),
+							 ItemPointerGetOffsetNumberNoCheck(min_htid));
+	else
+		values[j++] = NULL;
+
+	if (max_htid)
+		values[j++] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(max_htid),
+							 ItemPointerGetOffsetNumberNoCheck(max_htid));
+	else
+		values[j++] = NULL;
+
+	if (min_htid == NULL)
+		values[j++] = psprintf("0");
+	else if (!BTreeTupleIsPosting(itup))
+		values[j++] = psprintf("1");
+	else
+		values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+	if (!ItemIdIsDead(id))
+		values[j++] = psprintf("f");
+	else
+		values[j++] = psprintf("t");
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +430,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +461,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +547,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
+max_htid   | 
+nheap_tids | 1
+isdead     | f
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid,
+    OUT max_htid tid,
+    OUT nheap_tids int4,
+    OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v13-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v13-0001-Add-deduplication-to-nbtree.patchDownload

From 7b25e930eb60750e1e8c9f31182fb6ac8e6dfac0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v13 1/3] Add deduplication to nbtree.

---
 contrib/amcheck/verify_nbtree.c         | 164 +++++--
 src/backend/access/nbtree/README        |  74 +++-
 src/backend/access/nbtree/nbtinsert.c   | 555 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     | 148 ++++++-
 src/backend/access/nbtree/nbtree.c      | 147 +++++--
 src/backend/access/nbtree/nbtsearch.c   | 243 ++++++++++-
 src/backend/access/nbtree/nbtsort.c     | 148 ++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  47 +-
 src/backend/access/nbtree/nbtutils.c    | 253 +++++++++--
 src/backend/access/nbtree/nbtxlog.c     |  88 +++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  16 +-
 src/include/access/nbtree.h             | 242 +++++++++--
 src/include/access/nbtxlog.h            |  36 +-
 src/tools/valgrind.supp                 |  21 +
 14 files changed, 1998 insertions(+), 184 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..83519cb7cf 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..52651fcbe4 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple nposting);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size newitemsz);
+static Size _bt_dedup_insert(Page page, BTDedupState *dedupState);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +947,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +962,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +981,14 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1002,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1014,72 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+		int			nipd;
+		char	   *replacepos;
+		char	   *rightpos;
+		Size		nbytes;
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPosting(oposting));
+		nipd = BTreeTupleGetNPosting(oposting);
+		Assert(in_posting_offset < nipd);
+
+		nposting = CopyIndexTuple(oposting);
+		replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+		rightpos = replacepos + sizeof(ItemPointerData);
+		nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+		/*
+		 * Move item pointers in posting list to make a gap for the new item's
+		 * heap TID (shift TIDs one place to the right, losing original
+		 * rightmost TID).
+		 */
+		memmove(rightpos, replacepos, nbytes);
+
+		/*
+		 * Replace newitem's heap TID with rightmost heap TID from original
+		 * posting list
+		 */
+		ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+		/*
+		 * Copy original (not new original) posting list's last TID into new
+		 * item
+		 */
+		ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+		Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								  BTreeTupleGetHeapTID(itup)) < 0);
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1112,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 nposting);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1192,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place update of
+			 * the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1245,9 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingsz = 0;
+			if (nposting)
+				xlrec.postingsz = MAXALIGN(IndexTupleSize(itup));
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1285,9 @@ _bt_insertonpg(Relation rel,
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (nposting)
+				XLogRegisterBufData(0, (char *) nposting,
+									IndexTupleSize(nposting));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1329,10 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
 }
 
 /*
@@ -1211,10 +1350,16 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple nposting)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1381,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1426,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1500,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1536,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1646,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1652,6 +1833,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.replacepostingoff = replacepostingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1676,6 +1858,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
+		if (replacepostingoff)
+			XLogRegisterBufData(0, (char *) nposting,
+								MAXALIGN(IndexTupleSize(nposting)));
+
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
@@ -1834,7 +2020,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2490,343 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   Size newitemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->alltupsize = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(dedupState);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (dedupState->itupprev == NULL)
+		{
+			/* Just set up base/first item in first iteration */
+			Assert(offnum == minoff);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			continue;
+		}
+
+		if (deduplicate &&
+			_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+		{
+			int			itup_ntuples;
+			Size		projpostingsz;
+
+			/*
+			 * Tuples are equal.
+			 *
+			 * If posting list does not exceed tuple size limit then append
+			 * the tuple to the pending posting list.  Otherwise, insert it on
+			 * page and continue with this tuple as new pending posting list.
+			 */
+			itup_ntuples = BTreeTupleIsPosting(itup) ?
+				BTreeTupleGetNPosting(itup) : 1;
+
+			/*
+			 * Project size of new posting list that would result from merging
+			 * current tup with pending posting list (could just be prev item
+			 * that's "pending").
+			 *
+			 * This accounting looks odd, but it's correct because ...
+			 */
+			projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+									 (dedupState->ntuples + itup_ntuples + 1) *
+									 sizeof(ItemPointerData));
+
+			if (projpostingsz <= dedupState->maxitemsize)
+				_bt_dedup_item_tid(dedupState, itup);
+			else
+				pagesaving += _bt_dedup_insert(newpage, dedupState);
+		}
+		else
+		{
+			/*
+			 * Tuples are not equal, or we're done deduplicating items on this
+			 * page.
+			 *
+			 * Insert pending posting list on page.  This could just be a
+			 * regular tuple.
+			 */
+			pagesaving += _bt_dedup_insert(newpage, dedupState);
+		}
+
+		pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_insert(newpage, dedupState);
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_newpage_buffer(buffer, true);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(dedupState);
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_dedup_item_tid(BTDedupState *dedupState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+		dedupState->alltupsize =
+			MAXALIGN(IndexTupleSize(dedupState->itupprev)) +
+			sizeof(ItemIdData);
+
+		/*
+		 * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * itupprev would be merged with the next tuple, or stay as-is.  By
+		 * now caller compared it against itup and found that it was equal, so
+		 * we can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+		else
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+	}
+
+	/*
+	 * Add current tup to ipd for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+
+	dedupState->alltupsize +=
+		MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ *
+ * Returns space saving on page.
+ */
+static Size
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	itup;
+	Size		spacesaving = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		/*
+		 * Use original itupprev, which may or may not be a posting list
+		 * already from some earlier dedup attempt
+		 */
+		itup = dedupState->itupprev;
+	}
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+
+		spacesaving = dedupState->alltupsize -
+			(MAXALIGN(IndexTupleSize(postingtuple)) + sizeof(ItemIdData));
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+		itup = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	/* Add the new item into the page */
+	if (PageAddItem(page, (Item) itup, IndexTupleSize(itup),
+					OffsetNumberNext(PageGetMaxOffsetNumber(page)), false,
+					false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(itup);
+	dedupState->ntuples = 0;
+	dedupState->alltupsize = 0;
+
+	return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..5314bbe2a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..67595319d7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1431,41 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..af5e136af7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1701,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1744,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1758,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same truncated IndexTuple for every
+	 * logical tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..9f193768f2 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1127,6 +1138,40 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
 /*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
@@ -1141,9 +1186,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1313,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->alltupsize = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_dedup_item_tid(dedupState, itup);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..f7575ed48c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2694,74 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..d4d7c09ff0 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -178,12 +178,34 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 	{
 		Size		datalen;
 		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+		IndexTuple	nposting = NULL;
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingsz > 0)
+		{
+			IndexTuple	oposting;
+
+			Assert(isleaf);
+
+			/* oposting must be at offset before new item */
+			oposting = (IndexTuple) PageGetItem(page,
+												PageGetItemId(page, OffsetNumberPrev(xlrec->offnum)));
+			if (PageAddItem(page, (Item) datapos, xlrec->postingsz,
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+			nposting = (IndexTuple) (datapos + xlrec->postingsz);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,9 +287,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
-					left_hikeysz = 0;
+					left_hikeysz = 0,
+					npostingsz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -281,6 +305,17 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			datalen -= newitemsz;
 		}
 
+		if (xlrec->replacepostingoff)
+		{
+			Assert(xlrec->replacepostingoff ==
+				   OffsetNumberPrev(xlrec->newitemoff));
+
+			nposting = (IndexTuple) datapos;
+			npostingsz = MAXALIGN(IndexTupleSize(nposting));
+			datapos += npostingsz;
+			datalen -= npostingsz;
+		}
+
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
 		left_hikey = (IndexTuple) datapos;
 		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
@@ -304,6 +339,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == xlrec->replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, npostingsz,
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -386,8 +430,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +522,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..71763da4c8 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingsz %u",
+								 xlrec->offnum, xlrec->postingsz);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,21 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, replacepostingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->replacepostingoff);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..eade328511 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,119 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+	int			ntuples;
+	Size		alltupsize;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +453,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -499,6 +659,13 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +701,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +732,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +751,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -732,6 +905,7 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_item_tid(BTDedupState *dedupState, IndexTuple itup);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +936,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +988,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..35a65522f7 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,26 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingsz is not 0, data also contains 'nposting' -
+ *				 tuple to replace original.
+ *
+ *				 TODO probably it would be enough to keep just a flag to point
+ *				 out that data contains 'nposting' and compute its offset as
+ *				 we know it follows the tuple, but I am afraid that it will
+ *				 break alignment, will it?
+ *
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	uint32		postingsz;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingsz) + sizeof(uint32))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -95,6 +105,12 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'nposting',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -112,9 +128,10 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber replacepostingoff; /* offset of the posting item to replace */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, replacepostingoff) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -172,10 +189,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
-- 
2.17.1

#85

Oleg Bartunov

obartunov@postgrespro.ru

over 6 years ago

In reply to: Alexander Korotkov (#3)

Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.

On Tue, Sep 1, 2015 at 12:33 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

Hi, Tomas!

On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:

I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.

In general, index size is often a serious issue - cases where indexes need more space than tables are not quite uncommon in my experience. So I think the efforts to lower space requirements for indexes are good.

But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I create a GIN index (using btree_gin), I do get mostly the same thing you propose, no?

Yes, In general GIN is a btree with effective duplicates handling + support of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.

Is it worth to make a provision to add an ability to control how
duplicates are sorted ? If we speak about GIN, why not take into
account our experiments with RUM (https://github.com/postgrespro/rum)
?

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,

The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's not feasible to add gingettuple to GIN. At least with same semantics as it is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be hangle as "a < x" intersect "x < b". That is extremely inefficient. It is possible to fix. However, there is no clear proposal how to fit this case into GIN interface, yet.

but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the b-tree feature is not needed), or there are many updates.

From my observations users can use btree_gin only in some cases. They like compression, but can't use btree_gin mostly because of #1.

Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this going to be affected by introducing the posting lists?

I'd notice that current duplicates handling in PostgreSQL is hack over original btree. It is designed so in btree access method in PostgreSQL, not btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing: exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of posting tree.

One can say that concurrency would became worse because index would become smaller and number of pages would became smaller too. Since number of pages would be smaller, backends are more likely concur for the same page. But this argument can be user against any compression and for any bloat.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#86

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#84)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

13.09.2019 4:04, Peter Geoghegan wrote:

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.

Attached is v14 based on v12 (v13 changes are not merged).

In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they
trigger.
Here are test results (config is standard, except fsync=off to speedup
tests):

'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

+-------------------+-----------+-----------+----------------+-----------+

| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |

+-------------------+-----------+-----------+----------------+-----------+

| time | 09:12 min | 06:56 min | 06:24 min | 08:10 min |

| nbtree_wal_volume | 8083 MB | 2128 MB | 2327 MB | 2439 MB |

| index_size | 169 MB | 169 MB | 169 MB | 1118 MB |

+-------------------+-----------+-----------+----------------+-----------+

With random insertions into btree it's highly possible that
deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if
only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer()
approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

Also, since the index is packed way better than before, it probably
benefits less of wal_compression.

One possible "fix" to decrease WAL amplification is to add
REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of
inadequate WAL amount.
However, I doubt that it is a crash-safe idea.

Another, and more realistic approach is to make deduplication less
intensive:
if freed space is less than some threshold, fall back to not changing
page at all and not generating xlog record.

Probably that was the reason, why patch became faster after I added
BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because
WAL load decreased.

So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?

I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;

+-----------------+---------+---------+----------------+--------+

| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |

+-----------------+---------+---------+----------------+--------+

| checkpoints_req | 16 | 7 | 8 | 10 |

+-----------------+---------+---------+----------------+--------+

And I struggle to explain the reason of this.
Do you understand what can cause the difference?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v14-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v14-0001-Add-deduplication-to-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..399743d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..50ec9ef 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters).  When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed.  Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..605865e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple original_newitem, IndexTuple nposting,
+						OffsetNumber in_posting_offset);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size itemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset)
+{
+	int			nipd;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nipd = BTreeTupleGetNPosting(oposting);
+	Assert(in_posting_offset < nipd);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original
+	 * rightmost TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new
+	 * item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								BTreeTupleGetHeapTID(itup)) < 0);
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
+	IndexTuple	original_itup = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1051,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1064,47 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		original_itup = CopyIndexTuple(itup);
+
+		nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+		Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1136,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 original_itup, nposting, in_posting_offset);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1216,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place update of
+			 * the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1269,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.in_posting_offset = in_posting_offset;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1306,10 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (original_itup)
+				XLogRegisterBufData(0, (char *) original_itup, IndexTupleSize(original_itup));
+			else
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1351,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
+	if (original_itup)
+		pfree(original_itup);
+
 }
 
 /*
@@ -1211,10 +1375,17 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple original_newitem,
+		  IndexTuple nposting, OffsetNumber in_posting_offset)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1407,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1415,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1273,6 +1452,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1526,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1562,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1672,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1653,6 +1860,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
 
+		/*
+		 * If replacing posting item was put on the right page,
+		 * we don't need to explicitly WAL log it because it's included
+		 * with all the other items on the right page.
+		 * Otherwise, save in_posting_offset and newitem to construct
+		 * replacing tuple.
+		 */
+		xlrec.in_posting_offset = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.in_posting_offset = in_posting_offset;
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
 
@@ -1672,9 +1890,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * is not stored if XLogInsert decides it needs a full-page image of
 		 * the left page.  We store the offset anyway, though, to support
 		 * archive compression of these records.
+		 *
+		 * Also save newitem in case posting split was required
+		 * to construct new posting.
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.in_posting_offset)
+		{
+			if (xlrec.in_posting_offset)
+			{
+				Assert(original_newitem != NULL);
+				Assert(ItemPointerCompare(&original_newitem->t_tid, &newitem->t_tid) != 0);
+
+				XLogRegisterBufData(0, (char *) original_newitem,
+									MAXALIGN(IndexTupleSize(original_newitem)));
+			}
+			else
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2066,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2536,277 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
+	 */
+}
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
 	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= itemsz)
+		{
+			pfree(dedupState);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (dedupState->itupprev == NULL)
+		{
+			/* Just set up base/first item in first iteration */
+			Assert(offnum == minoff);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+			continue;
+		}
+
+		if (deduplicate &&
+			_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+		{
+			int			itup_ntuples;
+			Size		projpostingsz;
+
+			/*
+			 * Tuples are equal.
+			 *
+			 * If posting list does not exceed tuple size limit then append
+			 * the tuple to the pending posting list.  Otherwise, insert it on
+			 * page and continue with this tuple as new pending posting list.
+			 */
+			itup_ntuples = BTreeTupleIsPosting(itup) ?
+				BTreeTupleGetNPosting(itup) : 1;
+
+			/*
+			 * Project size of new posting list that would result from merging
+			 * current tup with pending posting list (could just be prev item
+			 * that's "pending").
+			 *
+			 * This accounting looks odd, but it's correct because ...
+			 */
+			projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+									 (dedupState->ntuples + itup_ntuples + 1) *
+									 sizeof(ItemPointerData));
+
+			if (projpostingsz <= dedupState->maxitemsize)
+				_bt_stash_item_tid(dedupState, itup, offnum);
+			else
+				_bt_dedup_insert(newpage, dedupState);
+		}
+		else
+		{
+			/*
+			 * Tuples are not equal, or we're done deduplicating this page.
+			 *
+			 * Insert pending posting list on page.  This could just be a
+			 * regular tuple.
+			 */
+			_bt_dedup_insert(newpage, dedupState);
+		}
+
+		pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+		dedupState->itupprev_off = offnum;
+
+		Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	}
+
+	/* Handle the last item */
+	_bt_dedup_insert(newpage, dedupState);
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (dedupState->n_intervals == 0)
+	{
+		pfree(dedupState);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.n_intervals =  dedupState->n_intervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/* only save non-empthy part of the array */
+		if (dedupState->n_intervals > 0)
+			XLogRegisterData((char *) dedupState->dedup_intervals,
+							 dedupState->n_intervals * sizeof(dedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (dedupState->ntuples == 0)
+	{
+		/*
+		 * Use original itupprev, which may or may not be a posting list
+		 * already from some earlier dedup attempt
+		 */
+		to_insert = dedupState->itupprev;
+	}
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..5314bbe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 }
 
 /*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
+/*
  * Delete item(s) from a btree page during single-page cleanup.
  *
  * As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..6759531 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1432,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..c78c8e6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer iptr,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ItemPointer iptr,
+									   IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -529,6 +552,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 }
 
 /*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
+/*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
  *	page/offnum: location of btree item to be compared to.
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1611,6 +1775,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/*
+		 * Have index-only scans return the same truncated IndexTuple for
+		 * every logical tuple that originates from the same posting list
+		 */
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+	}
+}
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..4198770 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1128,6 +1139,136 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+				   OffsetNumber itup_offnum)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+		/*
+		 * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * itupprev would be merged with the next tuple, or stay as-is.  By
+		 * now caller compared it against itup and found that it was equal, so
+		 * we can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+		else
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+
+		/* Save info about deduplicated items for future xlog record */
+		dedupState->n_intervals++;
+		/* Save offnum of the first item belongin to the group */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+		/*
+		 * Update the number of deduplicated items, belonging to this group.
+		 * Count each item just once, no matter if it was posting tuple or not
+		 */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+	}
+
+	/*
+	 * Add current tup to ipd for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+
+	/*
+	 * Update the number of deduplicated items, belonging to this group.
+	 * Count each item just once, no matter if it was posting tuple or not
+	 */
+	dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+	/* TODO just a debug message. delete it in final version of the patch */
+	if (itup_offnum != InvalidOffsetNumber)
+		elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+				dedupState->n_intervals,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1141,9 +1282,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1409,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0..e3d7f4f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2145,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2175,7 +2247,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2193,6 +2266,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2205,7 +2279,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2290,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2308,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2317,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2321,15 +2399,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2442,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2407,22 +2525,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2466,12 +2592,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2497,7 +2623,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2567,11 +2697,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..98ce964 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->in_posting_offset != InvalidOffsetNumber)
+		{
+			/* oposting must be at offset before new item */
+			ItemId		itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+			IndexTuple newitem = (IndexTuple) datapos;
+			IndexTuple nposting;
+
+			nposting = _bt_form_newposting(newitem, oposting,
+										   xlrec->in_posting_offset);
+			Assert(isleaf);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+
+			/* replace existing posting */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +291,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					 replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->in_posting_offset)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			/*
+			 * Repeat logic implemented in _bt_insertonpg():
+			 *
+			 * If the new tuple is a duplicate with a heap TID that falls
+			 * inside the range of an existing posting list tuple,
+			 * generate new posting tuple to replace original one
+			 * and update new tuple so that it's heap TID contains
+			 * the rightmost heap TID of original posting tuple.
+			 */
+			if (xlrec->in_posting_offset != 0)
+			{
+				ItemId		itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+				IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+				nposting = _bt_form_newposting(newitem, oposting,
+											xlrec->in_posting_offset);
+
+				/* Alter new item offset, since effective new item changed */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +355,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -380,14 +440,146 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 }
 
 static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	Page		newpage;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items
+		 * to that in item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque nopaque;
+		OffsetNumber offnum, minoff, maxoff;
+		BTDedupState *dedupState = NULL;
+		char *data = ((char *) xlrec + SizeOfBtreeDedup);
+		dedupInterval dedup_intervals[MaxOffsetNumber];
+		int			 nth_interval = 0;
+		OffsetNumber n_dedup_tups = 0;
+
+		dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+		dedupState->ipd = NULL;
+		dedupState->ntuples = 0;
+		dedupState->itupprev = NULL;
+		dedupState->maxitemsize = BTMaxItemSize(page);
+		dedupState->maxpostingsize = 0;
+
+		memcpy(dedup_intervals, data,
+			   xlrec->n_intervals*sizeof(dedupInterval));
+
+		/* Scan over all items to see which ones can be deduplicated */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+		nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		/* Make sure that new page won't have garbage flag set */
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+		/* Copy High Key if any */
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		/*
+		* Iterate over tuples on the page to deduplicate them into posting
+		* lists and insert into new page
+		*/
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+			elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+			if (dedupState->itupprev == NULL)
+			{
+				/* Just set up base/first item in first iteration */
+				Assert(offnum == minoff);
+				dedupState->itupprev = CopyIndexTuple(itup);
+				dedupState->itupprev_off = offnum;
+				continue;
+			}
+
+			/*
+			 * Instead of comparing tuple's keys, which may be costly, use
+			 * information from xlog record. If current tuple belongs to the
+			 * group of deduplicated items, repeat logic of _bt_dedup_one_page
+			 * and stash it to form a posting list afterwards.
+			 */
+			if (dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+				&& n_dedup_tups < dedup_intervals[nth_interval].ntups)
+			{
+				_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+				elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+				/* count first tuple in the group */
+				if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+					n_dedup_tups++;
+
+				/* count added tuple */
+				n_dedup_tups++;
+			}
+			else
+			{
+				_bt_dedup_insert(newpage, dedupState);
+
+				/* reset state */
+				if (n_dedup_tups > 0)
+					nth_interval++;
+				n_dedup_tups = 0;
+			}
+
+			pfree(dedupState->itupprev);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+		}
+
+		/* Handle the last item */
+		_bt_dedup_insert(newpage, dedupState);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
+static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +670,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
+
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -838,6 +1050,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..802e27b 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; in_posting_offset %u",
+								 xlrec->offnum, xlrec->in_posting_offset);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
+				/* FIXME: even master doesn't have newitemoff */
 				appendStringInfo(buf, "level %u, firstright %d",
 								 xlrec->level, xlrec->firstright);
 				break;
 			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "items were deduplicated to %d items",
+								 xlrec->n_intervals);
+				break;
+			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +143,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..d1af18f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,145 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+	OffsetNumber from;
+	OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+
+	/*
+	 * array with info about deduplicated items on the page.
+	 *
+	 * It contains one entry for each group of consecutive items that
+	 * were deduplicated into a single posting tuple.
+	 *
+	 * This array is saved to xlog entry, which allows to replay
+	 * deduplication faster without actually comparing tuple's keys.
+	 */
+	dedupInterval dedup_intervals[MaxOffsetNumber];
+	/* current number of items in dedup_intervals array */
+	int			n_intervals;
+	/* temp state variable to keep a 'possible' start of dedup interval */
+	OffsetNumber itupprev_off;
+
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
 
-/* Get/set downlink block number */
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +479,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +686,13 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -534,7 +727,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +758,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +777,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -732,6 +931,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset);
+extern void _bt_dedup_insert(Page page, BTDedupState *dedupState);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +964,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1016,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -824,5 +1031,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+							   OffsetNumber itup_offnum);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..075baaf 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* compactify tuples on the page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if in_posting_offset is valid, this is an insertion
+ *				 into existing posting tuple at offnum.
+ *				 redo must repeat logic of bt_insertonpg().
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber in_posting_offset;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -96,6 +102,11 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -113,9 +124,26 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (if placed on left page) */
+	OffsetNumber in_posting_offset; /* offset inside posting tuple  */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+	int			n_intervals;
+
+	/* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +201,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}

#87

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#86)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Attached is v14 based on v12 (v13 changes are not merged).

In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.

Cool.

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they trigger.
Here are test results (config is standard, except fsync=off to speedup tests):

'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.

With random insertions into btree it's highly possible that deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.

I find that hard to believe. Deduplication only occurs when we're
about to split the page. If that's almost as likely to occur as a
simple insert, then we're in big trouble (maybe it's actually true,
but if it is then that's the real problem). Also, fewer pages for the
index naturally leads to far fewer FPIs after a checkpoint.

I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the
same case on v13. It was practically the same as the master branch,
apart from the huge difference in FPIs for the XLOG rmgr. Aside from
that one huge difference, there was a similar volume of the same types
of WAL records in each case. Mostly leaf inserts, and far fewer
internal page inserts. I suppose this isn't surprising.

It probably makes sense for the final version of the patch to increase
the volume of WAL a little overall, since the savings for internal
page stuff cannot make up for the cost of having to WAL log something
extra (deduplication operations) on leaf pages, regardless of the size
of those extra dedup WAL records (I am ignoring FPIs after a
checkpoint in this analysis). So the patch is more or less certain to
add *some* WAL overhead in cases that benefit, and that's okay. But,
it adds way too much WAL overhead today (even in v14), for reasons
that we don't understand yet, which is not okay.

I may have misunderstood your approach to WAL-logging in v12. I
thought that you were WAL-logging things that didn't change, which
doesn't seem to be the case with v14. I thought that v12 was very
similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does
its WAL-logging. v14 looks good, though.

"pg_waldump -z" and "pg_waldump --stats=record" will break down the
contributing factor of FPIs, so it should be possible to account for
the overhead in the test case exactly. We can debug the problem by
using pg_waldump to count the absolute number of each type of record,
and the size of each type of record.

(Thinks some more...)

I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():

/*
* Copy the original page's LSN into leftpage, which will become the
* updated version of the page. We need this because XLogInsert will
* examine the LSN and possibly dump it in a page image.
*/
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);

Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().

That's why there is no significant difference with log_newpage_buffer() approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:

1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).

2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.

3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.

4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.

We should try to be careful when using these terms, as there is a very
real danger of talking past each other.

Another, and more realistic approach is to make deduplication less intensive:
if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.

I see that v14 uses the "dedupInterval" struct, which provides a
logical description of a deduplicated set of tuples. That general
approach is at least 95% of what I wanted from the
_bt_dedup_one_page() WAL-logging.

Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because WAL load decreased.

I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
faster as things are. I am not against bringing back
BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
think that it's a distraction. It may hide problems that we want to
fix. Like the PageSetLSN() problem I mentioned just now, and maybe
others.

We will definitely need to have page space accounting that's a bit
similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
full (or has 4 bytes left, or something). That happens regularly now.
That must start with teaching _bt_dedup_one_page() about how much
space it will free. Basing it on the number of items on the page or
whatever is not going to work that well.

I think that it would be possible to have something like
BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
deduplication incremental, in the sense that it can give up on
deduplication when it frees enough space (i.e. something like v13's
0002-* patch). I said that these two things are closely related, which
is true, but it's also true that they don't overlap.

Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
made a handful of specific indexes (mostly from TPC-H) significantly
smaller. I never tried to debug the problem. It's possible that we
could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
but not use it on rightmost pages, and get the best of both worlds.
IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
debug that when the time comes.

So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?

I think that there should be a target fill factor that sometimes makes
deduplication leave a small amount of free space. Maybe that means
that the last posting list on the page is made a bit smaller than the
other ones. It should be "goal orientated".

The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:

+           /*
+            * Project size of new posting list that would result from merging
+            * current tup with pending posting list (could just be prev item
+            * that's "pending").
+            *
+            * This accounting looks odd, but it's correct because ...
+            */
+           projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+                                    (dedupState->ntuples + itup_ntuples + 1) *
+                                    sizeof(ItemPointerData));

Why the "+1" here?

I have significantly refactored the _bt_dedup_one_page() loop in a way
that seems like a big improvement. It allowed me to remove all of the
small palloc() calls inside the loop, apart from the
BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems
to have shaved about 2 seconds off the "land" unlogged table test,
which was originally about 1 minute 2 seconds with v13's 0001-* patch
(and without v13's 0002-* patch).

It seems like can easily be integrated with the approach to WAL
logging taken in v14, so everything can be integrated soon. I'll work
on that.

I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;

And I struggle to explain the reason of this.
Do you understand what can cause the difference?

I imagine that the additional WAL volume triggered a checkpoint
earlier than in the more favorable test, which indirectly triggered
more FPIs, which contributed to triggering a checkpoint even
earlier...and so on. Synthetic test cases can avoid this. A useful
synthetic test should have no checkpoints at all, so that we can see
the broken down costs, without any second order effects that add more
cost in weird ways.

--
Peter Geoghegan

#88

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#87)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Sep 16, 2019 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():

/*
* Copy the original page's LSN into leftpage, which will become the
* updated version of the page. We need this because XLogInsert will
* examine the LSN and possibly dump it in a page image.
*/
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);

I can confirm that this is what the problem was. Attached are two patches:

* A version of your v14 from today with a couple of tiny changes to
make it work against the current master branch -- I had to rebase the
patch, but the changes made while rebasing were totally trivial. (I
like to keep CFTester green.)

* The second patch actually fixes the PageSetLSN() thing, setting the
temp page buffer's LSN to match the original page before any real work
is done, and before XLogInsert() is called. Just like _bt_split().

The test case now shows exactly what you reported for "FPWs off" when
FPWs are turned on, at least on my machine and with my checkpoint
settings. That is, there are *zero* FPIs/FPWs, so the final nbtree
volume is 2128 MB. This means that the volume of additional WAL
required over what the master branch requires for the same test case
is very small (2128 MB compares well with master's 2011 MB of WAL).
Maybe we could do better than 2128 MB with more work, but this is
definitely already low enough overhead to be acceptable. This also
passes "make check-world" testing.

However, my usual wal_consistency_checking smoke test fails pretty
quickly with the two patches applied:

3634/2019-09-16 13:53:22 PDT FATAL: inconsistent page found, rel
1663/16385/2673, forknum 0, blkno 13
3634/2019-09-16 13:53:22 PDT CONTEXT: WAL redo at 0/3202370 for
Btree/DEDUPLICATE: items were deduplicated to 12 items
3633/2019-09-16 13:53:22 PDT LOG: startup process (PID 3634) exited
with exit code 1

Maybe the lack of the PageSetLSN() thing masked a bug in WAL replay,
since without that we effectively always just replay FPIs, never truly
relying on redo. (I didn't try wal_consistency_checking without the
second patch, but I assume that you did, and found no problems for
this reason.)

Can you produce a new version that integrates the PageSetLSN() thing,
and fixes this bug?

Thanks
--
Peter Geoghegan

Attachments:

v141-0002-Add-_bt_split-style-WAL-optimization.patchapplication/octet-stream; name=v141-0002-Add-_bt_split-style-WAL-optimization.patchDownload

From d39f41ff50e8a72e5228a92102434e600d65a943 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 16 Sep 2019 13:39:21 -0700
Subject: [PATCH v141 2/2] Add _bt_split() style WAL optimization.

---
 src/backend/access/nbtree/nbtinsert.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 605865e85e..a3b7cee0c5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -2635,6 +2635,13 @@ _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
 	newpage = PageGetTempPageCopySpecial(page);
 	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
 
+	/*
+	 * Copy the original page's LSN into newpage, which will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	PageSetLSN(newpage, PageGetLSN(page));
+
 	/* Make sure that new page won't have garbage flag set */
 	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
-- 
2.17.1

v141-0001-v14-0001-Add-deduplication-to-nbtree.patch-from.patchapplication/octet-stream; name=v141-0001-v14-0001-Add-deduplication-to-nbtree.patch-from.patchDownload

From a4d17804d9980f845f6a64f61629e0bfde0906bd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 16 Sep 2019 13:26:58 -0700
Subject: [PATCH v141 1/2] v14-0001-Add-deduplication-to-nbtree.patch from
 Anastasia

---
 contrib/amcheck/verify_nbtree.c         | 128 +++++-
 src/backend/access/nbtree/README        |  76 +++-
 src/backend/access/nbtree/nbtinsert.c   | 541 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     | 148 ++++++-
 src/backend/access/nbtree/nbtree.c      | 147 +++++--
 src/backend/access/nbtree/nbtsearch.c   | 247 ++++++++++-
 src/backend/access/nbtree/nbtsort.c     | 243 ++++++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  47 +-
 src/backend/access/nbtree/nbtutils.c    | 264 ++++++++++--
 src/backend/access/nbtree/nbtxlog.c     | 241 ++++++++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  27 +-
 src/include/access/nbtree.h             | 275 ++++++++++--
 src/include/access/nbtxlog.h            |  49 ++-
 src/tools/valgrind.supp                 |  21 +
 14 files changed, 2268 insertions(+), 186 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..399743d4d6 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..50ec9ef48c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters).  When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed.  Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..605865e85e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple original_newitem, IndexTuple nposting,
+						OffsetNumber in_posting_offset);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size itemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset)
+{
+	int			nipd;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nipd = BTreeTupleGetNPosting(oposting);
+	Assert(in_posting_offset < nipd);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original
+	 * rightmost TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new
+	 * item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								BTreeTupleGetHeapTID(itup)) < 0);
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
+	IndexTuple	original_itup = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1051,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1063,47 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		original_itup = CopyIndexTuple(itup);
+
+		nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+		Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1136,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 original_itup, nposting, in_posting_offset);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1216,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place update of
+			 * the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1269,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.in_posting_offset = in_posting_offset;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1306,10 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (original_itup)
+				XLogRegisterBufData(0, (char *) original_itup, IndexTupleSize(original_itup));
+			else
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1351,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
+	if (original_itup)
+		pfree(original_itup);
+
 }
 
 /*
@@ -1211,10 +1375,17 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple original_newitem,
+		  IndexTuple nposting, OffsetNumber in_posting_offset)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1407,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1452,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1526,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1562,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1672,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1653,6 +1860,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
 
+		/*
+		 * If replacing posting item was put on the right page,
+		 * we don't need to explicitly WAL log it because it's included
+		 * with all the other items on the right page.
+		 * Otherwise, save in_posting_offset and newitem to construct
+		 * replacing tuple.
+		 */
+		xlrec.in_posting_offset = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.in_posting_offset = in_posting_offset;
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
 
@@ -1672,9 +1890,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * is not stored if XLogInsert decides it needs a full-page image of
 		 * the left page.  We store the offset anyway, though, to support
 		 * archive compression of these records.
+		 *
+		 * Also save newitem in case posting split was required
+		 * to construct new posting.
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.in_posting_offset)
+		{
+			if (xlrec.in_posting_offset)
+			{
+				Assert(original_newitem != NULL);
+				Assert(ItemPointerCompare(&original_newitem->t_tid, &newitem->t_tid) != 0);
+
+				XLogRegisterBufData(0, (char *) original_newitem,
+									MAXALIGN(IndexTupleSize(original_newitem)));
+			}
+			else
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2066,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2536,277 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= itemsz)
+		{
+			pfree(dedupState);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (dedupState->itupprev == NULL)
+		{
+			/* Just set up base/first item in first iteration */
+			Assert(offnum == minoff);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+			continue;
+		}
+
+		if (deduplicate &&
+			_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+		{
+			int			itup_ntuples;
+			Size		projpostingsz;
+
+			/*
+			 * Tuples are equal.
+			 *
+			 * If posting list does not exceed tuple size limit then append
+			 * the tuple to the pending posting list.  Otherwise, insert it on
+			 * page and continue with this tuple as new pending posting list.
+			 */
+			itup_ntuples = BTreeTupleIsPosting(itup) ?
+				BTreeTupleGetNPosting(itup) : 1;
+
+			/*
+			 * Project size of new posting list that would result from merging
+			 * current tup with pending posting list (could just be prev item
+			 * that's "pending").
+			 *
+			 * This accounting looks odd, but it's correct because ...
+			 */
+			projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+									 (dedupState->ntuples + itup_ntuples + 1) *
+									 sizeof(ItemPointerData));
+
+			if (projpostingsz <= dedupState->maxitemsize)
+				_bt_stash_item_tid(dedupState, itup, offnum);
+			else
+				_bt_dedup_insert(newpage, dedupState);
+		}
+		else
+		{
+			/*
+			 * Tuples are not equal, or we're done deduplicating this page.
+			 *
+			 * Insert pending posting list on page.  This could just be a
+			 * regular tuple.
+			 */
+			_bt_dedup_insert(newpage, dedupState);
+		}
+
+		pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+		dedupState->itupprev_off = offnum;
+
+		Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	}
+
+	/* Handle the last item */
+	_bt_dedup_insert(newpage, dedupState);
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (dedupState->n_intervals == 0)
+	{
+		pfree(dedupState);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.n_intervals =  dedupState->n_intervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/* only save non-empthy part of the array */
+		if (dedupState->n_intervals > 0)
+			XLogRegisterData((char *) dedupState->dedup_intervals,
+							 dedupState->n_intervals * sizeof(dedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (dedupState->ntuples == 0)
+	{
+		/*
+		 * Use original itupprev, which may or may not be a posting list
+		 * already from some earlier dedup attempt
+		 */
+		to_insert = dedupState->itupprev;
+	}
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..5314bbe2a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..67595319d7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1431,41 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c78c8e67b5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer iptr,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ItemPointer iptr,
+									   IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1774,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/*
+		 * Have index-only scans return the same truncated IndexTuple for
+		 * every logical tuple that originates from the same posting list
+		 */
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+	}
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..4198770303 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1127,6 +1138,136 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+				   OffsetNumber itup_offnum)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+		/*
+		 * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * itupprev would be merged with the next tuple, or stay as-is.  By
+		 * now caller compared it against itup and found that it was equal, so
+		 * we can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+		else
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+
+		/* Save info about deduplicated items for future xlog record */
+		dedupState->n_intervals++;
+		/* Save offnum of the first item belongin to the group */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+		/*
+		 * Update the number of deduplicated items, belonging to this group.
+		 * Count each item just once, no matter if it was posting tuple or not
+		 */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+	}
+
+	/*
+	 * Add current tup to ipd for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+
+	/*
+	 * Update the number of deduplicated items, belonging to this group.
+	 * Count each item just once, no matter if it was posting tuple or not
+	 */
+	dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+	/* TODO just a debug message. delete it in final version of the patch */
+	if (itup_offnum != InvalidOffsetNumber)
+		elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+				dedupState->n_intervals,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
 /*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
@@ -1141,9 +1282,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1409,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..d4710501a1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2140,6 +2174,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2208,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2217,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2242,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2261,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2274,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2285,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2303,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2312,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2394,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2437,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2520,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2587,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2618,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2692,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..98ce964ea9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->in_posting_offset != InvalidOffsetNumber)
+		{
+			/* oposting must be at offset before new item */
+			ItemId		itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+			IndexTuple newitem = (IndexTuple) datapos;
+			IndexTuple nposting;
+
+			nposting = _bt_form_newposting(newitem, oposting,
+										   xlrec->in_posting_offset);
+			Assert(isleaf);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+
+			/* replace existing posting */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +291,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					 replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->in_posting_offset)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			/*
+			 * Repeat logic implemented in _bt_insertonpg():
+			 *
+			 * If the new tuple is a duplicate with a heap TID that falls
+			 * inside the range of an existing posting list tuple,
+			 * generate new posting tuple to replace original one
+			 * and update new tuple so that it's heap TID contains
+			 * the rightmost heap TID of original posting tuple.
+			 */
+			if (xlrec->in_posting_offset != 0)
+			{
+				ItemId		itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+				IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+				nposting = _bt_form_newposting(newitem, oposting,
+											xlrec->in_posting_offset);
+
+				/* Alter new item offset, since effective new item changed */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +355,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -379,6 +439,138 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	Page		newpage;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items
+		 * to that in item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque nopaque;
+		OffsetNumber offnum, minoff, maxoff;
+		BTDedupState *dedupState = NULL;
+		char *data = ((char *) xlrec + SizeOfBtreeDedup);
+		dedupInterval dedup_intervals[MaxOffsetNumber];
+		int			 nth_interval = 0;
+		OffsetNumber n_dedup_tups = 0;
+
+		dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+		dedupState->ipd = NULL;
+		dedupState->ntuples = 0;
+		dedupState->itupprev = NULL;
+		dedupState->maxitemsize = BTMaxItemSize(page);
+		dedupState->maxpostingsize = 0;
+
+		memcpy(dedup_intervals, data,
+			   xlrec->n_intervals*sizeof(dedupInterval));
+
+		/* Scan over all items to see which ones can be deduplicated */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+		nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		/* Make sure that new page won't have garbage flag set */
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+		/* Copy High Key if any */
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		/*
+		* Iterate over tuples on the page to deduplicate them into posting
+		* lists and insert into new page
+		*/
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+			elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+			if (dedupState->itupprev == NULL)
+			{
+				/* Just set up base/first item in first iteration */
+				Assert(offnum == minoff);
+				dedupState->itupprev = CopyIndexTuple(itup);
+				dedupState->itupprev_off = offnum;
+				continue;
+			}
+
+			/*
+			 * Instead of comparing tuple's keys, which may be costly, use
+			 * information from xlog record. If current tuple belongs to the
+			 * group of deduplicated items, repeat logic of _bt_dedup_one_page
+			 * and stash it to form a posting list afterwards.
+			 */
+			if (dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+				&& n_dedup_tups < dedup_intervals[nth_interval].ntups)
+			{
+				_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+				elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+				/* count first tuple in the group */
+				if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+					n_dedup_tups++;
+
+				/* count added tuple */
+				n_dedup_tups++;
+			}
+			else
+			{
+				_bt_dedup_insert(newpage, dedupState);
+
+				/* reset state */
+				if (n_dedup_tups > 0)
+					nth_interval++;
+				n_dedup_tups = 0;
+			}
+
+			pfree(dedupState->itupprev);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+		}
+
+		/* Handle the last item */
+		_bt_dedup_insert(newpage, dedupState);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +578,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +670,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -838,6 +1050,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..7351cad1d2 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; in_posting_offset %u",
+								 xlrec->offnum, xlrec->in_posting_offset);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,29 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, in_posting_offset %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->in_posting_offset);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "items were deduplicated to %d items",
+								 xlrec->n_intervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +145,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..d1af18f864 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,145 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+	OffsetNumber from;
+	OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+
+	/*
+	 * array with info about deduplicated items on the page.
+	 *
+	 * It contains one entry for each group of consecutive items that
+	 * were deduplicated into a single posting tuple.
+	 *
+	 * This array is saved to xlog entry, which allows to replay
+	 * deduplication faster without actually comparing tuple's keys.
+	 */
+	dedupInterval dedup_intervals[MaxOffsetNumber];
+	/* current number of items in dedup_intervals array */
+	int			n_intervals;
+	/* temp state variable to keep a 'possible' start of dedup interval */
+	OffsetNumber itupprev_off;
+
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +479,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -499,6 +685,13 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +727,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +758,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +777,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -730,8 +929,11 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset);
+extern void _bt_dedup_insert(Page page, BTDedupState *dedupState);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +964,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1016,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -824,5 +1031,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+							   OffsetNumber itup_offnum);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..7d41adccac 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* compactify tuples on the page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if in_posting_offset is valid, this is an insertion
+ *				 into existing posting tuple at offnum.
+ *				 redo must repeat logic of bt_insertonpg().
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber in_posting_offset;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -95,6 +101,11 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -112,9 +123,26 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber in_posting_offset; /* offset inside posting tuple  */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+	int			n_intervals;
+
+	/* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -172,10 +200,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
-- 
2.17.1

#89

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#87)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

16.09.2019 21:58, Peter Geoghegan wrote:

On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they trigger.
Here are test results (config is standard, except fsync=off to speedup tests):

'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.

With random insertions into btree it's highly possible that deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.

<...>

I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():

/*
* Copy the original page's LSN into leftpage, which will become the
* updated version of the page. We need this because XLogInsert will
* examine the LSN and possibly dump it in a page image.
*/
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);

Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().

That's it. I suspected that such enormous amount of FPW reflects some bug.

That's why there is no significant difference with log_newpage_buffer() approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

My point was that the problem is extra FPWs, so it doesn't matter
whether we deduplicate just several entries to free enough space or all
of them.

The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:

1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).

Sure.

2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.

In v12-v15 I mostly concentrated on this feature.
The last version looks good to me.

3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.

I think it's a good idea. memmove must be much faster than copying
items tuple by tuple.
I'll send next patch by the end of the week.

4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.

We should try to be careful when using these terms, as there is a very
real danger of talking past each other.

Another, and more realistic approach is to make deduplication less intensive:
if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.

I see that v14 uses the "dedupInterval" struct, which provides a
logical description of a deduplicated set of tuples. That general
approach is at least 95% of what I wanted from the
_bt_dedup_one_page() WAL-logging.

Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because WAL load decreased.

I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
faster as things are. I am not against bringing back
BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
think that it's a distraction. It may hide problems that we want to
fix. Like the PageSetLSN() problem I mentioned just now, and maybe
others.

We will definitely need to have page space accounting that's a bit
similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
full (or has 4 bytes left, or something). That happens regularly now.
That must start with teaching _bt_dedup_one_page() about how much
space it will free. Basing it on the number of items on the page or
whatever is not going to work that well.

I think that it would be possible to have something like
BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
deduplication incremental, in the sense that it can give up on
deduplication when it frees enough space (i.e. something like v13's
0002-* patch). I said that these two things are closely related, which
is true, but it's also true that they don't overlap.

Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
made a handful of specific indexes (mostly from TPC-H) significantly
smaller. I never tried to debug the problem. It's possible that we
could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
but not use it on rightmost pages, and get the best of both worlds.
IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
debug that when the time comes.

Now that extra FPW are proven to be a bug, I agree that giving up on
deduplication early is not necessary.
My previous considerations were based on the idea that deduplication
always adds considerable overhead,
which is not true after recent optimizations.

So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?

I think that there should be a target fill factor that sometimes makes
deduplication leave a small amount of free space. Maybe that means
that the last posting list on the page is made a bit smaller than the
other ones. It should be "goal orientated".

The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:
+           /*
+            * Project size of new posting list that would result from merging
+            * current tup with pending posting list (could just be prev item
+            * that's "pending").
+            *
+            * This accounting looks odd, but it's correct because ...
+            */
+           projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+                                    (dedupState->ntuples + itup_ntuples + 1) *
+                                    sizeof(ItemPointerData));
Why the "+1" here?

I'll look at it.

I have significantly refactored the _bt_dedup_one_page() loop in a way
that seems like a big improvement. It allowed me to remove all of the
small palloc() calls inside the loop, apart from the
BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems
to have shaved about 2 seconds off the "land" unlogged table test,
which was originally about 1 minute 2 seconds with v13's 0001-* patch
(and without v13's 0002-* patch).

It seems like can easily be integrated with the approach to WAL
logging taken in v14, so everything can be integrated soon. I'll work
on that.

New version is attached.
It is v14 (with PageSetLSN fix) merged with v13.

I also fixed a bug in btree_xlog_dedup(), that was previously masked by FPW.
v15 passes make installcheck.
I haven't tested it with land test yet. Will do it later this week.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v15-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v15-0001-Add-deduplication-to-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..83519cb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2032,6 +2111,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 }
 
 /*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
+/*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
  * we rely on having fully unique keys to find a match with only a single
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..54cb9db 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..4257406 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple original_newitem, IndexTuple nposting,
+						OffsetNumber in_posting_offset);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size newitemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset)
+{
+	int			nipd;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nipd = BTreeTupleGetNPosting(oposting);
+	Assert(in_posting_offset < nipd);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original
+	 * rightmost TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new
+	 * item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								BTreeTupleGetHeapTID(itup)) < 0);
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
+	IndexTuple	original_itup = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1051,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1064,47 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		original_itup = CopyIndexTuple(itup);
+
+		nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+		Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1136,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 original_itup, nposting, in_posting_offset);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1216,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place update of
+			 * the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1269,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.in_posting_offset = in_posting_offset;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1306,10 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (original_itup)
+				XLogRegisterBufData(0, (char *) original_itup, IndexTupleSize(original_itup));
+			else
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1351,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
+	if (original_itup)
+		pfree(original_itup);
+
 }
 
 /*
@@ -1211,10 +1375,17 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple original_newitem,
+		  IndexTuple nposting, OffsetNumber in_posting_offset)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1407,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1415,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1273,6 +1452,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1526,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1562,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1672,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1653,6 +1860,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
 
+		/*
+		 * If replacing posting item was put on the right page,
+		 * we don't need to explicitly WAL log it because it's included
+		 * with all the other items on the right page.
+		 * Otherwise, save in_posting_offset and newitem to construct
+		 * replacing tuple.
+		 */
+		xlrec.in_posting_offset = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.in_posting_offset = in_posting_offset;
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
 
@@ -1672,9 +1890,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * is not stored if XLogInsert decides it needs a full-page image of
 		 * the left page.  We store the offset anyway, though, to support
 		 * archive compression of these records.
+		 *
+		 * Also save newitem in case posting split was required
+		 * to construct new posting.
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.in_posting_offset)
+		{
+			if (xlrec.in_posting_offset)
+			{
+				Assert(original_newitem != NULL);
+				Assert(ItemPointerCompare(&original_newitem->t_tid, &newitem->t_tid) != 0);
+
+				XLogRegisterBufData(0, (char *) original_newitem,
+									MAXALIGN(IndexTupleSize(original_newitem)));
+			}
+			else
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2066,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2536,415 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
+	 */
+}
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   Size newitemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->alltupsize = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(dedupState);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/*
+	 * Copy the original page's LSN into newpage, which will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	PageSetLSN(newpage, PageGetLSN(page));
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (dedupState->itupprev == NULL)
+		{
+			/* Just set up base/first item in first iteration */
+			Assert(offnum == minoff);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+			continue;
+		}
+
+		if (deduplicate &&
+			_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+		{
+			int			itup_ntuples;
+			Size		projpostingsz;
+
+			/*
+			 * Tuples are equal.
+			 *
+			 * If posting list does not exceed tuple size limit then append
+			 * the tuple to the pending posting list.  Otherwise, insert it on
+			 * page and continue with this tuple as new pending posting list.
+			 */
+			itup_ntuples = BTreeTupleIsPosting(itup) ?
+				BTreeTupleGetNPosting(itup) : 1;
+
+			/*
+			 * Project size of new posting list that would result from merging
+			 * current tup with pending posting list (could just be prev item
+			 * that's "pending").
+			 *
+			 * This accounting looks odd, but it's correct because ...
+			 */
+			projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+									 (dedupState->ntuples + itup_ntuples + 1) *
+									 sizeof(ItemPointerData));
+
+			if (projpostingsz <= dedupState->maxitemsize)
+				_bt_stash_item_tid(dedupState, itup, offnum);
+			else
+				pagesaving += _bt_dedup_insert(newpage, dedupState);
+		}
+		else
+		{
+			/*
+			 * Tuples are not equal, or we're done deduplicating items on this
+			 * page.
+			 *
+			 * Insert pending posting list on page.  This could just be a
+			 * regular tuple.
+			 */
+			pagesaving += _bt_dedup_insert(newpage, dedupState);
+		}
+
+		/*
+		 * When we have deduplicated enough to avoid page split, don't bother
+		 * deduplicating any more items.
+		 *
+		 * FIXME: If rewriting the page and doing the WAL logging were
+		 * incremental, we could actually break out of the loop and save real
+		 * work.  As things stand this is a loss for performance, but it
+		 * barely affects space utilization. (The number of blocks are the
+		 * same as before, except for rounding effects.  The minimum number of
+		 * items on each page for each index "increases" when this is enabled,
+		 * however.)
+		 */
+		if (pagesaving >= newitemsz)
+			deduplicate = false;
+
+		pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+		dedupState->itupprev_off = offnum;
+
+		Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_insert(newpage, dedupState);
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (dedupState->n_intervals == 0)
+	{
+		pfree(dedupState);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log deduplicated items */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.n_intervals =  dedupState->n_intervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/* only save non-empthy part of the array */
+		if (dedupState->n_intervals > 0)
+			XLogRegisterData((char *) dedupState->dedup_intervals,
+							 dedupState->n_intervals * sizeof(dedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(dedupState);
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+				   OffsetNumber itup_offnum)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+		dedupState->alltupsize =
+				MAXALIGN(IndexTupleSize(dedupState->itupprev)) +
+				sizeof(ItemIdData);
+
+		/*
+		 * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * itupprev would be merged with the next tuple, or stay as-is.  By
+		 * now caller compared it against itup and found that it was equal, so
+		 * we can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+		else
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+
+		/* Save info about deduplicated items for future xlog record */
+		dedupState->n_intervals++;
+		/* Save offnum of the first item belongin to the group */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+		/*
+		 * Update the number of deduplicated items, belonging to this group.
+		 * Count each item just once, no matter if it was posting tuple or not
+		 */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+	}
+
+	/*
+	 * Add current tup to ipd for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+
+	dedupState->alltupsize +=
+			MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	/*
+	 * Update the number of deduplicated items, belonging to this group.
+	 * Count each item just once, no matter if it was posting tuple or not
 	 */
+	dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+	/* TODO just a debug message. delete it in final version of the patch */
+	if (itup_offnum != InvalidOffsetNumber)
+		elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+				dedupState->n_intervals,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+Size
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	itup;
+	Size		spacesaving = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		/*
+		 * Use original itupprev, which may or may not be a posting list
+		 * already from some earlier dedup attempt
+		 */
+		itup = dedupState->itupprev;
+	}
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+
+		spacesaving = dedupState->alltupsize -
+			(MAXALIGN(IndexTupleSize(postingtuple)) + sizeof(ItemIdData));
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		itup = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	/* Add the new item into the page */
+	if (PageAddItem(page, (Item) itup, IndexTupleSize(itup),
+					OffsetNumberNext(PageGetMaxOffsetNumber(page)),
+					false, false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(itup);
+	dedupState->ntuples = 0;
+	dedupState->alltupsize = 0;
+
+	return spacesaving;
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..5314bbe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 }
 
 /*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
+/*
  * Delete item(s) from a btree page during single-page cleanup.
  *
  * As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..6759531 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1432,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..821e808 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -529,6 +552,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 }
 
 /*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
+/*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
  *	page/offnum: location of btree item to be compared to.
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1701,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1744,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1758,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1611,6 +1773,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same truncated IndexTuple for
+	 * every logical tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..b51365a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1128,6 +1139,40 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1141,9 +1186,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1313,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->alltupsize = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd..f7575ed 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
+
+			if (BTreeTupleIsPosting(ituple))
+			{
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
 			{
-				/* found the item */
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2694,74 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..5eace6e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->in_posting_offset != InvalidOffsetNumber)
+		{
+			/* oposting must be at offset before new item */
+			ItemId		itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+			IndexTuple newitem = (IndexTuple) datapos;
+			IndexTuple nposting;
+
+			nposting = _bt_form_newposting(newitem, oposting,
+										   xlrec->in_posting_offset);
+			Assert(isleaf);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+
+			/* replace existing posting */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +291,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					 replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->in_posting_offset)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			/*
+			 * Repeat logic implemented in _bt_insertonpg():
+			 *
+			 * If the new tuple is a duplicate with a heap TID that falls
+			 * inside the range of an existing posting list tuple,
+			 * generate new posting tuple to replace original one
+			 * and update new tuple so that it's heap TID contains
+			 * the rightmost heap TID of original posting tuple.
+			 */
+			if (xlrec->in_posting_offset != 0)
+			{
+				ItemId		itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+				IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+				nposting = _bt_form_newposting(newitem, oposting,
+											xlrec->in_posting_offset);
+
+				/* Alter new item offset, since effective new item changed */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +355,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -380,14 +440,147 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 }
 
 static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	Page		newpage;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items
+		 * to that in item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque nopaque;
+		OffsetNumber offnum, minoff, maxoff;
+		BTDedupState *dedupState = NULL;
+		char *data = ((char *) xlrec + SizeOfBtreeDedup);
+		dedupInterval dedup_intervals[MaxOffsetNumber];
+		int			 nth_interval = 0;
+		OffsetNumber n_dedup_tups = 0;
+
+		dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+		dedupState->ipd = NULL;
+		dedupState->ntuples = 0;
+		dedupState->itupprev = NULL;
+		dedupState->maxitemsize = BTMaxItemSize(page);
+		dedupState->maxpostingsize = 0;
+
+		memcpy(dedup_intervals, data,
+			   xlrec->n_intervals*sizeof(dedupInterval));
+
+		/* Scan over all items to see which ones can be deduplicated */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+		nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		/* Make sure that new page won't have garbage flag set */
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+		/* Copy High Key if any */
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		/*
+		* Iterate over tuples on the page to deduplicate them into posting
+		* lists and insert into new page
+		*/
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+			elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+			if (dedupState->itupprev == NULL)
+			{
+				/* Just set up base/first item in first iteration */
+				Assert(offnum == minoff);
+				dedupState->itupprev = CopyIndexTuple(itup);
+				dedupState->itupprev_off = offnum;
+				continue;
+			}
+
+			/*
+			 * Instead of comparing tuple's keys, which may be costly, use
+			 * information from xlog record. If current tuple belongs to the
+			 * group of deduplicated items, repeat logic of _bt_dedup_one_page
+			 * and stash it to form a posting list afterwards.
+			 */
+			if (nth_interval < xlrec->n_intervals &&
+				dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+				&& n_dedup_tups < dedup_intervals[nth_interval].ntups)
+			{
+				_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+				elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+				/* count first tuple in the group */
+				if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+					n_dedup_tups++;
+
+				/* count added tuple */
+				n_dedup_tups++;
+			}
+			else
+			{
+				_bt_dedup_insert(newpage, dedupState);
+
+				/* reset state */
+				if (n_dedup_tups > 0)
+					nth_interval++;
+				n_dedup_tups = 0;
+			}
+
+			pfree(dedupState->itupprev);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+		}
+
+		/* Handle the last item */
+		_bt_dedup_insert(newpage, dedupState);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
+static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +671,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
+
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -838,6 +1051,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04..7351cad 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; in_posting_offset %u",
+								 xlrec->offnum, xlrec->in_posting_offset);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,29 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, in_posting_offset %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->in_posting_offset);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "items were deduplicated to %d items",
+								 xlrec->n_intervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +145,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84..adf52c9 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,146 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+	OffsetNumber from;
+	OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+
+	/*
+	 * array with info about deduplicated items on the page.
+	 *
+	 * It contains one entry for each group of consecutive items that
+	 * were deduplicated into a single posting tuple.
+	 *
+	 * This array is saved to xlog entry, which allows to replay
+	 * deduplication faster without actually comparing tuple's keys.
+	 */
+	dedupInterval dedup_intervals[MaxOffsetNumber];
+	/* current number of items in dedup_intervals array */
+	int			n_intervals;
+	/* temp state variable to keep a 'possible' start of dedup interval */
+	OffsetNumber itupprev_off;
+
+	int			ntuples;
+	Size		alltupsize;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
 
-/* Get/set downlink block number */
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +480,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +687,13 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -534,7 +728,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +759,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +778,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -730,9 +930,13 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
-
+extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset);
+extern Size _bt_dedup_insert(Page page, BTDedupState *dedupState);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+							   OffsetNumber itup_offnum);
 /*
  * prototypes for functions in nbtsplitloc.c
  */
@@ -762,6 +966,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1018,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee0..7d41adc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* compactify tuples on the page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if in_posting_offset is valid, this is an insertion
+ *				 into existing posting tuple at offnum.
+ *				 redo must repeat logic of bt_insertonpg().
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber in_posting_offset;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -95,6 +101,11 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -112,9 +123,26 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber in_posting_offset; /* offset inside posting tuple  */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+	int			n_intervals;
+
+	/* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -172,10 +200,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}

#90

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#89)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.

I think it's a good idea. memmove must be much faster than copying
items tuple by tuple.
I'll send next patch by the end of the week.

I think that the biggest problem is that we copy all of the tuples,
including existing posting list tuples that can't be merged with
anything. Even if you assume that we'll never finish early (e.g. by
using logic like the "if (pagesaving >= newitemsz) deduplicate =
false;" thing), this can still unnecessarily slow down deduplication.
Very often, _bt_dedup_one_page() is called when 1/2 - 2/3 of the
space on the page is already used by a small number of very large
posting list tuples.

The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:

I'll look at it.

I'm currently working on merging my refactored version of
_bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
(I have finished merging the other WAL-logging stuff, though -- that
was easy.)

The general idea is that the loop in _bt_dedup_one_page() now
explicitly operates with a "base" tuple, rather than *always* saving
the prev tuple from the last loop iteration. We always have a "pending
posting list", which won't be written-out as a posting list if it
isn't possible to merge at least one existing page item. The "base"
tuple doesn't change. "pagesaving" space accounting works in a way
that doesn't care about whether or not the base tuple was already a
posting list -- it saves the size of the IndexTuple without any
existing posting list size, and calculates the contribution to the
total size of the new posting list separately (heap TIDs from the
original base tuple and subsequent tuples are counted together).

This has a number of advantages:

* The loop is a lot clearer now, and seems to have slightly better
space utilization because of improved accounting (with or without the
"if (pagesaving >= newitemsz) deduplicate = false;" thing).

* I think that we're going to need to be disciplined about which tuple
is the "base" tuple for correctness reasons -- we should always use
the leftmost existing tuple to form a new posting list tuple. I am
concerned about rare cases where we deduplicate tuples that are equal
according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless
have different sizes (and are not bitwise equal). There are rare cases
involving TOAST compression where that is just about possible (see the
temp comments I added to _bt_keep_natts_fast() in the patch).

* It's clearly faster, because there is far less palloc() overhead --
the "land" unlogged table test completes in about 95.5% of the time
taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate =
false;" for both versions here, to keep it simple and fair).

This also suggests that making _bt_dedup_one_page() do raw page adds
and page deletes to the page in shared_buffers (i.e. don't use a temp
buffer page) could pay off. As I went into at the start of this
e-mail, unnecessarily doing expensive things like copying large
posting lists around is a real concern. Even if it isn't truly useful
for _bt_dedup_one_page() to operate in a very incremental fashion,
incrementalism is probably still a good thing to aim for -- it seems
to make deduplication faster in all cases.

--
Peter Geoghegan

#91

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#90)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote:

This also suggests that making _bt_dedup_one_page() do raw page adds
and page deletes to the page in shared_buffers (i.e. don't use a temp
buffer page) could pay off. As I went into at the start of this
e-mail, unnecessarily doing expensive things like copying large
posting lists around is a real concern. Even if it isn't truly useful
for _bt_dedup_one_page() to operate in a very incremental fashion,
incrementalism is probably still a good thing to aim for -- it seems
to make deduplication faster in all cases.

I think that I forgot to mention that I am concerned that the
kill_prior_tuple/LP_DEAD optimization could be applied less often
because _bt_dedup_one_page() operates too aggressively. That is a big
part of my general concern.

Maybe I'm wrong about this -- who knows? I definitely think that
LP_DEAD setting by _bt_check_unique() is generally a lot more
important than LP_DEAD setting by the kill_prior_tuple optimization,
and the patch won't affect unique indexes. Only very serious
benchmarking can give us a clear answer, though.

--
Peter Geoghegan

#92

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#90)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote:

I'm currently working on merging my refactored version of
_bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
(I have finished merging the other WAL-logging stuff, though -- that
was easy.)

I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.

Hopefully this will be a bit easier to work with when you go to make
_bt_dedup_one_page() do raw PageIndexMultiDelete() + PageAddItem()
calls against the page contained in a buffer directly (rather than
using a temp version of the page in local memory in the style of
_bt_split()). I find the loop within _bt_dedup_one_page() much easier
to follow now.

While I'm looking forward to seeing the
PageIndexMultiDelete()/PageAddItem() approach that you come up with,
the basic design of _bt_dedup_one_page() seems to be in much better
shape today than it was a few weeks ago. I am going to spend the next
few days teaching _bt_dedup_one_page() about space utilization. I'll
probably make it respect a fillfactor-style target. I've noticed that
it is often too aggressive about filling a page, though less often it
actually shows the opposite problem: it fails to use more than about
2/3 of the page for the same value, again and again (must be something
to do with the exact width of the tuples). In general,
_bt_dedup_one_page() should know a few things about what nbtsplitloc.c
will do when the page is very likely to be split soon.

I'll also spend some more time working on the opclass infrastructure
that we need to disable deduplication with datatypes where it is
unsafe [1]/messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan.

Other changes:

* qsort() is no longer used by BTreeFormPostingTuple() in v16 -- we
can easily sorting the array of heap TIDs the caller's responsibility.
Since the heap TID column is sorted in ascending order among
duplicates on a page, and since TIDs within individual posting lists
are also sorted in ascending order, there is no need to resort. I
added a new assertion to BTreeFormPostingTuple() that verifies that
its caller actually gets it right.

* The new nbtpage.c/VACUUM code has been tweaked to minimize the
changes required against master. Nothing significant, though.

It was easier to refactor the _bt_dedup_one_page() stuff by
temporarily making nbtsort.c not use it. I didn't want to delay
getting v16 to you, so I didn't take the time to fix-up nbtsort.c to
use the new stuff. It's actually using its own old copy of stuff that
it should get from nbtinsert.c in v16 -- it calls
_bt_dedup_item_tid_sort(), not the new _bt_dedup_save_htid() function.
I'll update it soon, though.

[1]: /messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v16-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v16-0001-Add-deduplication-to-nbtree.patchDownload

From 45931efca014c9550d06a208574d9e508c85800b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v16 1/2] Add deduplication to nbtree.

---
 contrib/amcheck/verify_nbtree.c         | 164 +++++-
 src/backend/access/index/genam.c        |   4 +
 src/backend/access/nbtree/README        |  74 ++-
 src/backend/access/nbtree/nbtinsert.c   | 741 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     | 148 ++++-
 src/backend/access/nbtree/nbtree.c      | 128 +++-
 src/backend/access/nbtree/nbtsearch.c   | 243 +++++++-
 src/backend/access/nbtree/nbtsort.c     | 231 +++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  47 +-
 src/backend/access/nbtree/nbtutils.c    | 264 ++++++++-
 src/backend/access/nbtree/nbtxlog.c     | 249 +++++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  26 +-
 src/include/access/nbtree.h             | 281 ++++++++-
 src/include/access/nbtxlog.h            |  55 +-
 src/tools/valgrind.supp                 |  21 +
 15 files changed, 2505 insertions(+), 171 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..83519cb7cf 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..710c8d5cd5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple original_newitem,
+						IndexTuple nposting, OffsetNumber in_posting_offset);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size newitemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,74 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'itup' is a duplicate with a heap
+ * TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * It must also change newitem to have the heap TID of the rightmost TID in
+ * the original posting list.
+ *
+ * Note that the WAL-logging considerations for posting list splits are
+ * complicated by the need to WAL-log the original newitem passed here instead
+ * of the effective/final newitem actually inserted on the page.  This routine
+ * is used during recovery to avoid naively WAL-logging posting list returned
+ * here, which is often much larger than the typical newitem.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+				  OffsetNumber in_posting_offset)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(in_posting_offset < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1019,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1038,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
+	IndexTuple	original_itup = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1060,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1072,46 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		original_itup = CopyIndexTuple(itup);
+		nposting = _bt_posting_split(itup, oposting, in_posting_offset);
+
+		Assert(BTreeTupleGetNPosting(nposting) ==
+			   BTreeTupleGetNPosting(oposting));
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1144,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 original_itup, nposting, in_posting_offset);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1224,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Posting list split requires an in-place update of the existing
+			 * posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1277,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.in_posting_offset = in_posting_offset;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1314,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (!original_itup)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) original_itup,
+									IndexTupleSize(original_itup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1368,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
+	if (original_itup)
+		pfree(original_itup);
+
 }
 
 /*
@@ -1211,10 +1392,19 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		original_newitem, nposting, and in_posting_offset are needed for
+ *		posting list splits that happen to result in a page split.
+ *		nposting is a replacement tuple for the posting list tuple at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple original_newitem, IndexTuple nposting,
+		  OffsetNumber in_posting_offset)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1426,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1471,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1545,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1581,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1691,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1653,6 +1879,29 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
 
+		/*
+		 * If the replacement posting list (and final newitem) go on the right
+		 * page then we don't need to explicitly WAL log it for the same
+		 * reason we don't log any kind of newitem when it goes on the right
+		 * page: it's included with all the other items on the right page
+		 * already.
+		 *
+		 * Otherwise, we set in_posting_offset in WAL record, and explicitly
+		 * log the original newitem (not the effective newitem).  This allows
+		 * REDO to reconstruct nposting by following essentially the same
+		 * procedure as our caller used.
+		 *
+		 * Note: It's possible that our split point makes the posting list
+		 * lastleft, and the rewritten newitem firstright.  That's okay, since
+		 * we'll log the original newitem either way. (Only the _final_
+		 * version of newitem is available to REDO as the first data item from
+		 * left page in this case, so explicitly logging the original newitem
+		 * only occurs when strictly necessary.)
+		 */
+		xlrec.in_posting_offset = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.in_posting_offset = in_posting_offset;
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
 
@@ -1673,8 +1922,29 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * the left page.  We store the offset anyway, though, to support
 		 * archive compression of these records.
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.in_posting_offset != InvalidOffsetNumber)
+		{
+			if (xlrec.in_posting_offset == InvalidOffsetNumber)
+			{
+				/* simple, common case -- must WAL-log ordinary newitem */
+				Assert(newitemonleft);
+				Assert(nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/*
+				 * REDO must reconstruct effective/final new item from
+				 * original newitem, while updating existing posting list
+				 * tuple that was split in place.  Log the original new item
+				 * instead of the final new item.
+				 */
+				Assert(ItemPointerCompare(&original_newitem->t_tid,
+										  &newitem->t_tid) != 0);
+				XLogRegisterBufData(0, (char *) original_newitem,
+									MAXALIGN(IndexTupleSize(original_newitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2104,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2574,439 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   Size newitemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+	/* Convenience variables concerning generic limits */
+	state->maxitemsize = BTMaxItemSize(page);
+	state->maxpostingsize = 0;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->base_off = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Finally, n_intervals should be initialized to zero */
+	state->n_intervals = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/*
+	 * Copy the original page's LSN into newpage, which will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	PageSetLSN(newpage, PageGetLSN(page));
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (offnum == minoff)
+		{
+			/*
+			 * No previous/base tuple for first data item -- use first data
+			 * item as base tuple of first pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list, and
+			 * merging itup into pending posting list won't exceed the
+			 * BTMaxItemSize() limit.  Heap TID(s) for itup have been saved in
+			 * state.  The next iteration will also end up here if it's
+			 * possible to merge the next tuple into the same pending posting
+			 * list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * BTMaxItemSize() limit was reached
+			 */
+			pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+			/*
+			 * When we have deduplicated enough to avoid page split, don't
+			 * bother merging together existing tuples to create new posting
+			 * lists.
+			 *
+			 * Note: We deliberately add as many heap TIDs as possible to a
+			 * pending posting list by performing this check at this point
+			 * (just before a new pending posting lists is created).  It would
+			 * be possible to make the final new posting list for each
+			 * successful page deduplication operation as small as possible
+			 * while still avoiding a page split for caller.  We don't want to
+			 * repeatedly merge posting lists around the same range of heap
+			 * TIDs, though.
+			 *
+			 * (Besides, the total number of new posting lists created is the
+			 * cost that this check is supposed to minimize -- there is no
+			 * great reason to be concerned about the absolute number of
+			 * existing tuples that can be killed/replaced.)
+			 */
+#if 0
+			/* Actually, don't do that */
+			/* TODO: Make a final decision on this */
+			if (pagesaving >= newitemsz)
+				deduplicate = false;
+#endif
+
+			/* itup starts new pending posting list */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (state->n_intervals == 0)
+	{
+		pfree(newpage);
+		pfree(state->htids);
+		pfree(state);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log deduplicated items */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.n_intervals = state->n_intervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/* only save non-empthy part of the array */
+		if (state->n_intervals > 0)
+			XLogRegisterData((char *) state->dedup_intervals,
+							 state->n_intervals * sizeof(dedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple on the page either becomes the base tuple for a posting list or
+ * gets merged with pending posting list at least once.  It remains to be seen
+ * whether or not it will actually be possible to merge together subsequent
+ * tuples on the page with this one, though.
+ *
+ * Exported for use by recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber base_off)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->base_off = base_off;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save base_off in pending state for interval */
+	state->dedup_intervals[state->n_intervals].from = state->base_off;
+}
+
+/*
+ * Add new posting tuple item to the page based on base and the saved list of
+ * heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Page page, BTDedupState *state)
+{
+	IndexTuple	final;
+	Size		finalsz;
+	OffsetNumber finaloff;
+	Size		spacesaving;
+
+	Assert(state->nhtids > 0);
+	Assert(state->nitems >= 1);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->dedup_intervals[state->n_intervals].from == state->base_off);
+
+	if (state->nitems == 1)
+	{
+		/* Use original, unchanged base tuple */
+		final = state->base;
+		spacesaving = 0;
+		finalsz = IndexTupleSize(final);
+
+		/* Do not increment n_intervals -- skip WAL logging */
+	}
+	else
+	{
+		/* Form a tuple with a posting list */
+		final = BTreeFormPostingTuple(state->base, state->htids,
+									  state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->dedup_intervals[state->n_intervals].nitems = state->nitems;
+
+		/* Advance to next candidate */
+		state->n_intervals++;
+	}
+
+
+	finaloff = OffsetNumberNext(PageGetMaxOffsetNumber(page));
+
+	Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+	Assert(finalsz <= state->maxitemsize);
+	if (PageAddItem(page, (Item) final, finalsz, finaloff, false,
+					false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (final != state->base)
+		pfree(final);
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * If it's not possible to merge itup with pending posting list, returns
+ * false; caller should finish the pending posting list, and start a new one
+ * with itup as its base tuple.  Otherwise, saves itup's heap TID(s) to local
+ * state, guaranteeing that at least that many heap TIDs can be merged
+ * together later on, when the current pending posting list is finished.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..648825e895 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc0(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..b03bf67c26 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreepostingremains(BTVacState *vstate, IndexTuple itup,
+									   int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1191,15 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+
+		/* Updatable item state (for posting lists_ */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1239,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1249,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1274,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining = 0;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples.
+					 *
+					 * Have to consider the need to VACUUM away the "logical"
+					 * typles contained in posting list tuple
+					 */
+					newhtids = btreepostingremains(vstate, itup, &nremaining);
+					if (nremaining == 0)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted, we can delete whole physical tuple as
+						 * if it wasn't a posting list tuple.
+						 */
+						deletable[ndeletable++] = offnum;
+						Assert(newhtids == NULL);
+					}
+					else if (nremaining < BTreeTupleGetNPosting(itup))
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * A subset of the logical tuples/TIDs must remain.
+						 * Perform an update (page delete + page add item) to
+						 * delete some but not all logical tuples in the
+						 * posting list.
+						 *
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 *
+						 * Note that the new tuple won't be a posting list
+						 * tuple when only one remaining logical tuple isn't
+						 * in the process of being killed.
+						 */
+						updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+															 nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple are
+						 * remain, so no update or delete is required.
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+						pfree(newhtids);
+					}
+				}
 			}
 		}
 
@@ -1274,7 +1346,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1362,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1375,6 +1448,43 @@ restart:
 	}
 }
 
+/*
+ * btreepostingremains() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list.  The array's size is returned by setting *nremaining.
+ *
+ * If all items are dead, returns NULL.
+ */
+static ItemPointer
+btreepostingremains(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..af5e136af7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1701,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1744,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1758,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same truncated IndexTuple for every
+	 * logical tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..480a7824d4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -285,6 +285,8 @@ static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
 static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
+static void _bt_sortdedup(BTWriteState *wstate, BTPageState *state,
+						  BTDedupState *dedupState);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
@@ -301,6 +303,7 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 									   BTShared *btshared, Sharedsort *sharedsort,
 									   Sharedsort *sharedsort2, int sortmem,
 									   bool progress);
+static void _bt_dedup_item_tid_sort(BTDedupState *dedupState, IndexTuple itup);
 
 
 /*
@@ -798,6 +801,43 @@ _bt_sortaddtup(Page page,
 		elog(ERROR, "failed to add item to the index page");
 }
 
+/*
+ * Add new tuple (posting or non-posting) to the page being built.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup(), but it avoids incremental
+ * space accounting, and adds a new tuple using nbtsort.c facilities.
+ */
+static void
+_bt_sortdedup(BTWriteState *wstate, BTPageState *state,
+			  BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->nhtids == 0)
+		to_insert = dedupState->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->base,
+											 dedupState->htids,
+											 dedupState->nhtids);
+		to_insert = postingtuple;
+		pfree(dedupState->htids);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->nhtids > 0)
+		pfree(to_insert);
+	dedupState->nhtids = 0;
+}
+
 /*----------
  * Add an item to a disk page from the sort output.
  *
@@ -830,6 +870,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +1005,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1049,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1091,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1141,9 +1190,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1317,99 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+			/* Convenience variables concerning generic limits */
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+			/* Metadata about current pending posting list */
+			dedupState->htids = NULL;
+			dedupState->nhtids = 0;
+			dedupState->nitems = 0;
+			dedupState->alltupsize = 0;
+			/* Metadata about based tuple of current pending posting list */
+			dedupState->base = NULL;
+			dedupState->base_off = InvalidOffsetNumber;
+			dedupState->basetupsize = 0;
+			/* Finally, n_intervals should be initialized to zero */
+			dedupState->n_intervals = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->base != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->base, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->nhtids + 1) *
+							sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_dedup_item_tid_sort(dedupState, itup);
+						else
+							_bt_sortdedup(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert base into index.  Save
+						 * current tuple for the next iteration.
+						 */
+						_bt_sortdedup(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->base)
+					pfree(dedupState->base);
+				dedupState->base = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->base->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->base));
+			}
+
+			/* Handle the last item */
+			_bt_sortdedup(wstate, state, dedupState);
 		}
 	}
 
@@ -1798,3 +1938,72 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
 }
+
+/*
+ * FIXME: Merge this with _bt_dedup_item_tid(), which still has global
+ * linkage.
+ */
+static void
+_bt_dedup_item_tid_sort(BTDedupState *dedupState, IndexTuple itup)
+{
+	int			nposting = 0;
+
+	if (dedupState->nhtids == 0)
+	{
+		dedupState->htids = palloc0(dedupState->maxitemsize);
+		dedupState->alltupsize =
+			MAXALIGN(IndexTupleSize(dedupState->base)) +
+			sizeof(ItemIdData);
+
+		/*
+		 * base hasn't had its posting list TIDs copied into htids yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * base would be merged with the next tuple, or stay as-is.  By now
+		 * caller compared it against itup and found that it was equal, so we
+		 * can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->base))
+		{
+			memcpy(dedupState->htids, dedupState->base,
+				   sizeof(ItemPointerData));
+			dedupState->nhtids++;
+		}
+		else
+		{
+			/* if base is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->base);
+			memcpy(dedupState->htids,
+				   BTreeTupleGetPosting(dedupState->base),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->nhtids += nposting;
+		}
+	}
+
+	/*
+	 * Add current tup to htids for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->htids + dedupState->nhtids, itup,
+			   sizeof(ItemPointerData));
+		dedupState->nhtids++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->htids + dedupState->nhtids,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->nhtids += nposting;
+	}
+
+	dedupState->alltupsize +=
+		MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..7460bf264d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2694,85 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple.  Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple.  It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..ae786404ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,39 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->in_posting_offset == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			/* posting list split (of posting list just before new item) */
+			ItemId		itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			IndexTuple	oposting = (IndexTuple) PageGetItem(page, itemid);
+			IndexTuple	newitem = (IndexTuple) datapos;
+			IndexTuple	nposting;
+
+			/*
+			 * Reconstruct nposting from original newitem, and make original
+			 * newitem into final newitem
+			 */
+			nposting = _bt_posting_split(newitem, oposting,
+										 xlrec->in_posting_offset);
+			Assert(isleaf);
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Replace existing/original posting list */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +295,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->in_posting_offset)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			/*
+			 * Repeat logic implemented in _bt_insertonpg():
+			 *
+			 * If the new tuple is a duplicate with a heap TID that falls
+			 * inside the range of an existing posting list tuple, generate
+			 * new posting tuple to replace original, and update new tuple
+			 * from WAL record so that it becomes the "final" newitem inserted
+			 * originally.
+			 */
+			if (xlrec->in_posting_offset != 0)
+			{
+				ItemId		itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+				IndexTuple	oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+				nposting = _bt_posting_split(newitem, oposting,
+											 xlrec->in_posting_offset);
+
+				/* Split posting list must be at offset before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +359,16 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -379,6 +444,141 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	Page		newpage;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque nopaque;
+		OffsetNumber offnum,
+					minoff,
+					maxoff;
+		BTDedupState *state = NULL;
+		char	   *data = ((char *) xlrec + SizeOfBtreeDedup);
+		dedupInterval dedup_intervals[MaxIndexTuplesPerPage];
+		int			nth_interval = 0;
+		OffsetNumber interval_nitems = 0;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		/* Convenience variables concerning generic limits */
+		state->maxitemsize = BTMaxItemSize(page);
+		state->maxpostingsize = 0;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->base_off = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->n_intervals = 0;
+
+		memcpy(dedup_intervals, data,
+			   xlrec->n_intervals * sizeof(dedupInterval));
+
+		/* Scan over all items to see which ones can be deduplicated */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+		nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		/* Make sure that new page won't have garbage flag set */
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+		/* Copy High Key if any */
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page to deduplicate them into posting
+		 * lists and insert into new page
+		 */
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == minoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+				interval_nitems++;
+			}
+			else if (nth_interval < xlrec->n_intervals &&
+					 state->base_off >= dedup_intervals[nth_interval].from &&
+					 interval_nitems < dedup_intervals[nth_interval].nitems)
+			{
+				/*
+				 * Item is a part of pending posting list that will be formed
+				 * using base tuple
+				 */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+
+				interval_nitems++;
+			}
+			else
+			{
+				/*
+				 * Tuple was not equal to pending posting list tuple on
+				 * primary, or BTMaxItemSize() limit was reached on primary
+				 */
+				_bt_dedup_finish_pending(newpage, state);
+
+				/* reset state */
+				if (interval_nitems > 1)
+					nth_interval++;
+				interval_nitems = 0;
+
+				/* itup starts new pending posting list */
+				_bt_dedup_start_pending(state, itup, offnum);
+				interval_nitems++;
+			}
+		}
+
+		/* Handle the last item */
+		_bt_dedup_finish_pending(newpage, state);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +586,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +678,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -838,6 +1058,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..022cf091b1 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; in_posting_offset %u",
+								 xlrec->offnum, xlrec->in_posting_offset);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,28 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, in_posting_offset %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->in_posting_offset);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "n_intervals %d", xlrec->n_intervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +144,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..d0346c06c8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,153 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'from', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ * (Note: nitems means the number of line pointer items -- the tuples in
+ * question may already be posting list tuples or regular tuples.)
+ */
+typedef struct dedupInterval
+{
+	OffsetNumber from;
+	OffsetNumber nitems;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.  htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+	/* Convenience variables concerning generic limits */
+	Size		maxitemsize;	/* BTMaxItemSize() limit for page */
+	Size		maxpostingsize;
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # valid heap TIDs in nhtids array */
+	int			nitems;			/* See dedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+
+	/* Metadata about based tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber base_off;		/* original page offset of base */
+	Size		basetupsize;	/* Excludes line pointer overhead */
+
+
+	/*
+	 * array with info about deduplicated items on the page.  Current array
+	 * size is n_intervals.
+	 *
+	 * It contains one entry for each group of consecutive items that were
+	 * deduplicated into a single posting tuple.
+	 *
+	 * This array is saved to xlog entry, which allows to replay deduplication
+	 * faster without actually comparing tuple's keys.
+	 */
+	int			n_intervals;
+	dedupInterval dedup_intervals[MaxIndexTuplesPerPage];
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +487,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -499,6 +693,13 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +735,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +766,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +785,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -730,8 +937,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+									OffsetNumber in_posting_offset);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern Size _bt_dedup_finish_pending(Page page, BTDedupState *state);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +975,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1027,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+										int nhtids);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..761073ada5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if in_posting_offset is set, this started out as an
+ *				 insertion into an existing posting tuple at the
+ *				 offset before offnum (i.e. it's a posting list split).
+ *				 (REDO will have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber in_posting_offset;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -95,6 +101,13 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that happens to result in a page split.  REDO recognizes this case
+ * when in_posting_offset is set, and must use the posting offset to do an
+ * in-place update of the existing posting list that was actually split, and
+ * change the newitem to the "final" newitem.  This corresponds to the
+ * xl_btree_insert in_posting_offset-set case.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -112,9 +125,26 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber in_posting_offset; /* offset inside orig posting tuple  */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the number of posting tuples that should be added
+ * to the page using n_intervals.  An array of dedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+	int			n_intervals;
+
+	/* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, n_intervals) + sizeof(int))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +196,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
-- 
2.17.1

#93

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#92)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.

I attach version 17. This version has changes that are focussed on
further polishing certain things, including fixing some minor bugs. It
seemed worth creating a new version for that. (I didn't get very far
with the space utilization stuff I talked about, so no changes there.)

Changes in v17:

* nbtsort.c now has a loop structure that closely matches
_bt_dedup_one_page() (I put this off in v16).

We now reuse most of the nbtinsert.c deduplication routines.

* Further simplification of btree_xlog_dedup() loop.

Recovery no longer relies on local variables to track the progress of
deduplication -- it uses dedup state (the state managed by
nbtinsert.c's dedup routines) instead. This is easier to follow.

* Reworked _bt_split() comments on posting list splits that coincide
with page splits.

* Fixed memory leaks in recovery code by creating a dedicated memory
context that gets reset regularly. The context is create in a new rmgr
"startup" callback I created for the B-Tree rmgr. We already do this
for both GIN and GiST.

More specifically, the REDO code calls MemoryContextReset() against
its dedicated memory context after every record is processed by REDO,
no matter what. The MemoryContextReset() call usually won't have to
actually free anything, but that's okay because the no-free case does
almost no work. I think that it makes sense to keep things as simple
as possible for memory management during recovery -- it's too easy for
a new memory leak to get introduced when a small change is made to the
nbtinsert.c routines later on.

* Optimize VACUUMing of posting lists: we now only allocate memory for
an array of still-live posting list items when the array will actually
be needed. It is only needed when there are tuples to remove from the
posting list, because only then do we need to create a replacement
posting list that lacks the heap TIDs that VACUUM needs to delete.

It seemed like a really good idea to not allocate any memory in the
common case where VACUUM doesn't need to change a posting list tuple
at all. ginVacuumItemPointers() has exactly the same optimization.

* Fixed an accounting bug in the output of VACCUM VERBOSE by changing
some code in nbtree.c.

The tuples_removed and num_index_tuples fields in
IndexBulkDeleteResult are reported as "index row versions" by VACUUM
VERBOSE. Everything but the index pages stat works at the level of
"index row versions", which should not be affected by the
deduplication patch. Of course, deduplication only changes the
physical representation of items in the index -- never the logical
contents of the index. This is what GIN does already.

Another infrastructure thing that the patch needs to handle to be committable:

We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter. It's not obvious how to do this in a way that's
easy to work with, though. Maybe we could do something like copy GIN's
GinGetUseFastUpdate() macro, but the situation with nbtree is actually
quite different. There are two questions for nbtree when it comes to
deduplication within an inde: 1) Does the user want to use
deduplication, because that will help performance?, and 2) Is it
safe/possible to use deduplication at all?

I think that we should probably stash this information (deduplication
is both possible and safe) in the metapage. Maybe we can copy it over
to our insertion scankey, just like the "heapkeyspace" field -- that
information also comes from the metapage (it's based on the nbtree
version). The "heapkeyspace" field is a bit ugly, so maybe we
shouldn't go further by adding something similar, but I don't see any
great alternative right now.

--
Peter Geoghegan

Attachments:

v17-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v17-0001-Add-deduplication-to-nbtree.patchDownload

From 2e0ae900205fa421efabf2854d27e0810c3adf61 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v17 1/4] Add deduplication to nbtree.

---
 contrib/amcheck/verify_nbtree.c         | 164 +++++-
 src/backend/access/index/genam.c        |   4 +
 src/backend/access/nbtree/README        |  74 ++-
 src/backend/access/nbtree/nbtinsert.c   | 751 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     | 148 ++++-
 src/backend/access/nbtree/nbtree.c      | 168 +++++-
 src/backend/access/nbtree/nbtsearch.c   | 242 +++++++-
 src/backend/access/nbtree/nbtsort.c     | 138 ++++-
 src/backend/access/nbtree/nbtsplitloc.c |  47 +-
 src/backend/access/nbtree/nbtutils.c    | 264 ++++++++-
 src/backend/access/nbtree/nbtxlog.c     | 268 ++++++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  26 +-
 src/include/access/nbtree.h             | 278 ++++++++-
 src/include/access/nbtxlog.h            |  68 ++-
 src/include/access/rmgrlist.h           |   2 +-
 src/tools/valgrind.supp                 |  21 +
 16 files changed, 2489 insertions(+), 174 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..d65e2a76eb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..eb9655bb78 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size newitemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.postingoff = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list.  Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context.  Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery.  Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page.  This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+				  OffsetNumber postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1026,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1045,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1067,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1079,46 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(postingoff > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_posting_split(itup, oposting, postingoff);
+
+		Assert(BTreeTupleGetNPosting(nposting) ==
+			   BTreeTupleGetNPosting(oposting));
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1151,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1231,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Posting list split requires an in-place update of the existing
+			 * posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1284,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1321,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1375,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1397,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1437,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1482,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1556,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1592,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1702,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1887,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1911,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in record, though.
+		 *
+		 * The details are often slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery can't even tell the difference).  Otherwise, we set
+		 * postingoff and log orignewitem instead of newitem, despite having
+		 * actually inserted newitem.  Recovery must reconstruct nposting and
+		 * newitem by repeating the actions of our caller (i.e. by passing
+		 * original posting list and orignewitem to _bt_posting_split()).
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2110,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2580,439 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   Size newitemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->deduplicate = true;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Finally, nintervals should be initialized to zero */
+	state->nintervals = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/*
+	 * Copy the original page's LSN into newpage, which will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	PageSetLSN(newpage, PageGetLSN(page));
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (offnum == minoff)
+		{
+			/*
+			 * No previous/base tuple for first data item -- use first data
+			 * item as base tuple of first pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (state->deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list, and
+			 * merging itup into pending posting list won't exceed the
+			 * BTMaxItemSize() limit.  Heap TID(s) for itup have been saved in
+			 * state.  The next iteration will also end up here if it's
+			 * possible to merge the next tuple into the same pending posting
+			 * list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * BTMaxItemSize() limit was reached
+			 */
+			pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+			/*
+			 * When we have deduplicated enough to avoid page split, don't
+			 * bother merging together existing tuples to create new posting
+			 * lists.
+			 *
+			 * Note: We deliberately add as many heap TIDs as possible to a
+			 * pending posting list by performing this check at this point
+			 * (just before a new pending posting lists is created).  It would
+			 * be possible to make the final new posting list for each
+			 * successful page deduplication operation as small as possible
+			 * while still avoiding a page split for caller.  We don't want to
+			 * repeatedly merge posting lists around the same range of heap
+			 * TIDs, though.
+			 *
+			 * (Besides, the total number of new posting lists created is the
+			 * cost that this check is supposed to minimize -- there is no
+			 * great reason to be concerned about the absolute number of
+			 * existing tuples that can be killed/replaced.)
+			 */
+#if 0
+			/* Actually, don't do that */
+			/* TODO: Make a final decision on this */
+			if (pagesaving >= newitemsz)
+				state->deduplicate = false;
+#endif
+
+			/* itup starts new pending posting list */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (state->nintervals == 0)
+	{
+		pfree(newpage);
+		pfree(state->htids);
+		pfree(state);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log deduplicated items */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.nintervals = state->nintervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		Assert(state->nintervals > 0);
+		XLogRegisterData((char *) state->intervals,
+						 state->nintervals * sizeof(BTDedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->intervals[state->nintervals].baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Page page, BTDedupState *state)
+{
+	IndexTuple	final;
+	Size		finalsz;
+	OffsetNumber finaloff;
+	Size		spacesaving;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+	if (state->nitems == 1)
+	{
+		/* Use original, unchanged base tuple */
+		final = state->base;
+		spacesaving = 0;
+		finalsz = IndexTupleSize(final);
+
+		/* Do not increment nintervals -- skip WAL logging/replay */
+	}
+	else
+	{
+		/* Form a tuple with a posting list */
+		final = BTreeFormPostingTuple(state->base, state->htids,
+									  state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->intervals[state->nintervals].nitems = state->nitems;
+
+		/* Advance to next candidate */
+		state->nintervals++;
+	}
+
+	finaloff = OffsetNumberNext(PageGetMaxOffsetNumber(page));
+	Assert(finalsz <= state->maxitemsize);
+	Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+	if (PageAddItem(page, (Item) final, finalsz, finaloff, false,
+					false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (final != state->base)
+		pfree(final);
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..ecf75ef2c0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..baea34ea74 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1191,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1241,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1254,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+															 nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1351,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1378,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1393,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1403,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..9022ee68ea 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +806,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1560,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1595,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1700,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1743,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1757,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1771,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a base version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..c51cbfb0ba 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1057,6 +1063,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+
+	Assert(dstate->nitems > 0);
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dstate->base,
+											 dstate->htids,
+											 dstate->nhtids);
+		final = postingtuple;
+	}
+
+	_bt_buildadd(wstate, state, final);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1144,6 +1186,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	/* Don't use deduplication for INCLUDE indexes or unique indexes */
+	deduplicate = (keysz == IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1152,6 +1199,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		 * btspool and btspool2.
 		 */
 
+		Assert(!deduplicate);
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
@@ -1255,9 +1303,95 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->deduplicate = true; /* unused */
+		dstate->maxitemsize = 0;	/* set later */
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		dstate->nintervals = 0; /* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+				dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+				/* Conservatively size array */
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * BTMaxItemSize() limit.  Heap TID(s) for itup have been
+				 * saved in state.  The next iteration will also end up here
+				 * if it's possible to merge the next tuple into the same
+				 * pending posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * BTMaxItemSize() limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..7460bf264d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2694,85 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple.  Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple.  It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..365f0b4c79 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
 #include "miscadmin.h"
 
+static MemoryContext opCtx;		/* working memory for operations */
+
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
@@ -181,9 +184,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_posting_split() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_posting_split() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_posting_split(newitem, oposting,
+										 xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +305,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				/*
+				 * Use _bt_posting_split() to repeat posting list split steps
+				 * from primary
+				 */
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_posting_split() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_posting_split(newitem, oposting,
+											 xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +366,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +453,130 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	Page		newpage;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque nopaque;
+		OffsetNumber offnum,
+					minoff,
+					maxoff;
+		BTDedupState *state;
+		BTDedupInterval *intervals;
+
+		/* Get 'nintervals'-sized array of intervals to process */
+		intervals = (BTDedupInterval *) ((char *) xlrec + SizeOfBtreeDedup);
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->deduplicate = true;	/* unused */
+		state->maxitemsize = BTMaxItemSize(page);
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->nintervals = 0;
+
+		/* Scan over all items to see which ones can be deduplicated */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+		nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		/* Make sure that new page won't have garbage flag set */
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+		/* Copy High Key if any */
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page to deduplicate them into posting
+		 * lists and insert into new page
+		 */
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == minoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else if (state->nintervals < xlrec->nintervals &&
+					 state->baseoff == intervals[state->nintervals].baseoff &&
+					 state->nitems < intervals[state->nintervals].nitems)
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+			else
+			{
+				/*
+				 * Tuple was not equal to pending posting list tuple on
+				 * primary, or BTMaxItemSize() limit was reached on primary
+				 */
+				_bt_dedup_finish_pending(newpage, state);
+
+				/* itup starts new pending posting list */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+		}
+
+		/* Handle the last item */
+		_bt_dedup_finish_pending(newpage, state);
+
+		/* Assert that final working state matches WAL record state */
+		Assert(state->nintervals == xlrec->nintervals);
+		Assert(memcmp(state->intervals, intervals,
+					  state->nintervals * sizeof(BTDedupInterval)) == 0);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +584,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +676,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +1038,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1058,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1086,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..177875224a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,28 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "nintervals %d", xlrec->nintervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +144,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..da3c8f76a3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,150 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.  htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+	/* Deduplication status info for entire page/operation */
+	bool		deduplicate;	/* Still deduplicating page? */
+	Size		maxitemsize;	/* BTMaxItemSize() limit for page */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # valid heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+
+	/* Metadata about based tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* original page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Array of pending posting lists.  Contains one entry for each group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	int			nintervals;		/* current size of intervals array */
+	BTDedupInterval intervals[MaxIndexTuplesPerPage];
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +484,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -499,6 +690,13 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +732,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +763,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +782,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -730,8 +934,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+									OffsetNumber postingoff);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Page page, BTDedupState *state);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +972,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1024,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+										int nhtids);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..affdd910ec 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +97,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +127,27 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the number of posting tuples that should be added
+ * to the page using nintervals.  An array of dedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+	int			nintervals;
+
+	/* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nintervals) + sizeof(int))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
-- 
2.17.1

#94

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#93)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan <pg@bowt.ie> wrote:

I attach version 17.

I attach a patch that applies on top of v17. It adds support for
deduplication within unique indexes. Actually, this is a snippet of
code that appeared in my prototype from August 5 (we need very little
extra code for this now). Unique index support kind of looked like a
bad idea at the time, but things have changed a lot since then.

I benchmarked this overnight using a custom pgbench-based test that
used a Zipfian distribution, with a single-row SELECT and an UPDATE of
pgbench_accounts per xact. I used regular/logged tables this time
around, since WAL-logging is now fairly efficient. I added a separate
low cardinality index on pgbench_accounts(abalance). A low cardinality
index is the most interesting case for this patch, obviously, but it
also serves to prevent all HOT updates, increasing bloat for both
indexes. We want a realistic case that creates a lot of index bloat.

This wasn't a rigorous enough benchmark to present here in full, but
the results were very encouraging. With reasonable client counts for
the underlying hardware, we seem to have a small increase in TPS, and
a small decrease in latency. There is a regression with 128 clients,
when contention is ridiculously high (this is my home server, which
only has 4 cores). More importantly:

* The low cardinality index is almost 3x smaller with the patch -- no
surprises there.

* The read latency is where latency goes up, if it goes up at all.
Whatever it is that might increase latency, it doesn't look like it's
deduplication itself. Yeah, deduplication is expensive, but it's not
nearly as expensive as a page split. (I'm talking about the immediate
cost, not the bigger picture, though the bigger picture matters even
more.)

* The growth in primary key size over time is the thing I find really
interesting. The patch seems to really control the number of pages
splits over many hours with lots of non-HOT updates. I think that a
timeline of days or weeks could be really interesting.

I am now about 75% convinced that adding deduplication to unique
indexes is a good idea, at least as an option that is disabled by
default. We're already doing well here, even though there has been no
work on optimizing deduplication in unique indexes. Further
optimizations seem quite possible, though. I'm mostly thinking about
optimizing non-HOT updates by teaching nbtree some basic things about
versioning with unique indexes.

For example, we could remember "recently dead" duplicates of the value
we are about to insert (as part of an UPDATE statement) from within
_bt_check_unique(). Then, when it looks like a page split may be
necessary, we can hint to _bt_dedup_one_page(): "please just
deduplicate the group of duplicates starting from this offset, which
are duplicates of the this new item I am inserting -- do not create a
posting list that I will have to split, though". We already cache the
binary search bounds established within _bt_check_unique() in
insertstate, so perhaps we could reuse that to make this work. The
goal here is that the the old/recently dead versions end up together
in their own posting list (or maybe two posting lists), whereas our
new/most current tuple is on its own. There is a very good chance that
our transaction will commit, leaving somebody else to set the LP_DEAD
bit on the posting list that contains those old versions. In short,
we'd be making deduplication and opportunistic garbage collection
cooperate closely.

This can work both ways -- maybe we should also teach
_bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is,
if we add the mechanism I just described in the last paragraph, maybe
_bt_dedup_one_page() marks the posting list that is likely to get its
LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here,
LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit
on this posting list tuple very soon". That way, if nobody actually
does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If
it goes to the heap and finds the latest version, and that that
version is visible to any possible MVCC snapshot, that means that it's
safe to kill all the other versions, even without the LP_DEAD bit set
-- this is a unique index. So, it often gets to kill lots of extra
garbage that it wouldn't get to kill, preventing page splits. The cost
is pretty low: the risk that the single heap page check will be a
wasted effort. (Of course, we still have to visit the heap pages of
things that we go on to kill, to get the XIDs to generate recovery
conflicts -- the important point is that we only need to visit one
heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do
all this -- cases that don't benefit at all also don't pay very much.)

I don't think that we need to do either of these two other things to
justify committing the patch with unique index support. But, teaching
nbtree a little bit about versioning like this could work rather well
in practice, without it really mattering that it will get the wrong
idea at times (e.g. when transactions abort a lot). This all seems
promising as a family of techniques for unique indexes. It's worth
doing extra work if it might delay a page split, since delaying can
actually fully prevent page splits that are mostly caused by non-HOT
updates. Most primary key indexes are serial PKs, or some kind of
counter. Postgres should mostly do page splits for these kinds of
primary keys indexes in the places that make sense based on the
dataset, and not because of "write amplification".

--
Peter Geoghegan

Attachments:

v17-0005-Reintroduce-unique-index-support.patchapplication/octet-stream; name=v17-0005-Reintroduce-unique-index-support.patchDownload

From 4884272644e6772a3f5ae9d87fae2236e5ac1f01 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 23 Sep 2019 20:28:20 -0700
Subject: [PATCH v17 5/5] Reintroduce unique index support

---
 src/backend/access/nbtree/nbtinsert.c | 70 +++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index eb9655bb78..1912fe9ee4 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -434,15 +434,36 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			if (!ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
+				bool		posting;
 				bool		all_dead;
+				bool		posting_all_dead;
+				int			npost;
+
 
 				if (_bt_compare(rel, itup_key, page, offset) != 0)
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				Assert(!BTreeTupleIsPosting(curitup));
-				htid = curitup->t_tid;
+
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					posting = false;
+					posting_all_dead = true;
+				}
+				else
+				{
+					posting = true;
+					/* Initial assumption */
+					posting_all_dead = true;
+				}
+
+				npost = 0;
+doposttup:
+				if (posting)
+					htid = *BTreeTupleGetPostingN(curitup, npost);
+
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -453,6 +474,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					ItemPointerCompare(&htid, &itup->t_tid) == 0)
 				{
 					found = true;
+					posting_all_dead = false;
+					if (posting)
+						goto nextpost;
 				}
 
 				/*
@@ -518,8 +542,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -577,7 +600,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && !posting)
 				{
 					/*
 					 * The conflicting tuple (or whole HOT chain) is dead to
@@ -596,6 +619,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+				else if (posting)
+				{
+nextpost:
+					if (!all_dead)
+						posting_all_dead = false;
+
+					/* Iterate over single posting list tuple */
+					npost++;
+					if (npost < BTreeTupleGetNPosting(curitup))
+						goto doposttup;
+
+					/*
+					 * Mark posting tuple dead if all hot chains whose root is
+					 * contained in posting tuple have tuples that are all
+					 * dead
+					 */
+					if (posting_all_dead)
+					{
+						ItemIdMarkDead(curitemid);
+						opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+						if (nbuf != InvalidBuffer)
+							MarkBufferDirtyHint(nbuf, true);
+						else
+							MarkBufferDirtyHint(insertstate->buf, true);
+					}
+
+					/* Move on to next index tuple */
+				}
 			}
 		}
 
@@ -770,7 +822,7 @@ _bt_findinsertloc(Relation rel,
 				insertstate->bounds_valid = false;
 			}
 
-			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			if (PageGetFreeSpace(page) < insertstate->itemsz)
 			{
 				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
 								   insertstate->itemsz);
@@ -2615,12 +2667,10 @@ _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
 	Size		pagesaving = 0;
 
 	/*
-	 * Don't use deduplication for indexes with INCLUDEd columns and unique
-	 * indexes
+	 * Don't use deduplication for indexes with INCLUDEd columns
 	 */
 	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
-				   IndexRelationGetNumberOfAttributes(rel) &&
-				   !rel->rd_index->indisunique);
+				   IndexRelationGetNumberOfAttributes(rel));
 	if (!deduplicate)
 		return;
 
-- 
2.17.1

#95

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#93)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

24.09.2019 3:13, Peter Geoghegan wrote:

On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.

I attach version 17. This version has changes that are focussed on
further polishing certain things, including fixing some minor bugs. It
seemed worth creating a new version for that. (I didn't get very far
with the space utilization stuff I talked about, so no changes there.)

Attached is v18. In this version bt_dedup_one_page() is refactored so that:
- no temp page is used, all updates are applied to the original page.
- each posting tuple wal logged separately.
This also allowed to simplify btree_xlog_dedup significantly.

Another infrastructure thing that the patch needs to handle to be committable:

We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter. It's not obvious how to do this in a way that's
easy to work with, though. Maybe we could do something like copy GIN's
GinGetUseFastUpdate() macro, but the situation with nbtree is actually
quite different. There are two questions for nbtree when it comes to
deduplication within an inde: 1) Does the user want to use
deduplication, because that will help performance?, and 2) Is it
safe/possible to use deduplication at all?

I'll send another version with dedup option soon.

I think that we should probably stash this information (deduplication
is both possible and safe) in the metapage. Maybe we can copy it over
to our insertion scankey, just like the "heapkeyspace" field -- that
information also comes from the metapage (it's based on the nbtree
version). The "heapkeyspace" field is a bit ugly, so maybe we
shouldn't go further by adding something similar, but I don't see any
great alternative right now.

Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and
_bt_load().

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v18-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v18-0001-Add-deduplication-to-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..d65e2a7 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2032,6 +2111,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 }
 
 /*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
+/*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
  * we rely on having fully unique keys to find a match with only a single
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..6e1dc59 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..54cb9db 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..c81f545 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size newitemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.postingoff = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list.  Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context.  Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery.  Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page.  This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+				  OffsetNumber postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1026,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1045,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1067,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1080,46 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(postingoff > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_posting_split(itup, oposting, postingoff);
+
+		Assert(BTreeTupleGetNPosting(nposting) ==
+			   BTreeTupleGetNPosting(oposting));
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1151,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1231,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Posting list split requires an in-place update of the existing
+			 * posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1284,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1321,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1375,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1397,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1437,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1445,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1273,6 +1482,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1556,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1592,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1702,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1887,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1911,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in record, though.
+		 *
+		 * The details are often slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery can't even tell the difference).  Otherwise, we set
+		 * postingoff and log orignewitem instead of newitem, despite having
+		 * actually inserted newitem.  Recovery must reconstruct nposting and
+		 * newitem by repeating the actions of our caller (i.e. by passing
+		 * original posting list and orignewitem to _bt_posting_split()).
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2110,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2580,405 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   Size newitemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	bool		deduplicate;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->deduplicate = true;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 * NOTE It's essential to calculate max offset on each iteration,
+	 * since it could have changed if several items were replaced with a
+	 * single posting tuple.
+	 */
+	offnum = minoff;
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data
+			 * item as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (state->deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list, and
+			 * merging itup into pending posting list won't exceed the
+			 * BTMaxItemSize() limit.  Heap TID(s) for itup have been saved in
+			 * state.  The next iteration will also end up here if it's
+			 * possible to merge the next tuple into the same pending posting
+			 * list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * BTMaxItemSize() limit was reached.
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page,
+			 * otherwise, just reset the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+			/*
+			 * When we have deduplicated enough to avoid page split, don't
+			 * bother merging together existing tuples to create new posting
+			 * lists.
+			 *
+			 * Note: We deliberately add as many heap TIDs as possible to a
+			 * pending posting list by performing this check at this point
+			 * (just before a new pending posting lists is created).  It would
+			 * be possible to make the final new posting list for each
+			 * successful page deduplication operation as small as possible
+			 * while still avoiding a page split for caller.  We don't want to
+			 * repeatedly merge posting lists around the same range of heap
+			 * TIDs, though.
+			 *
+			 * (Besides, the total number of new posting lists created is the
+			 * cost that this check is supposed to minimize -- there is no
+			 * great reason to be concerned about the absolute number of
+			 * existing tuples that can be killed/replaced.)
+			 */
+#if 0
+			/* Actually, don't do that */
+			/* TODO: Make a final decision on this */
+			if (pagesaving >= newitemsz)
+				state->deduplicate = false;
+#endif
+
+			/* Continue iteration from base tuple's offnum */
+			offnum = state->baseoff;
+
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/*
+	 * Handle the last item, if pending posting list is not empty.
+	 */
+	if (state->nitems != 0)
+		pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	if (state->nitems > 1)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int 		 ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = BTreeFormPostingTuple(state->base, state->htids,
+									state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..ecf75ef 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 }
 
 /*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
+/*
  * Delete item(s) from a btree page during single-page cleanup.
  *
  * As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..baea34e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1191,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1241,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1254,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+															 nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1351,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1378,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1393,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1403,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1376,6 +1456,68 @@ restart:
 }
 
 /*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..9022ee6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -529,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 }
 
 /*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
+/*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
  *	page/offnum: location of btree item to be compared to.
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +806,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1560,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1595,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1700,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1743,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1757,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1611,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a base version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..f6ca690 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1058,6 +1064,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 }
 
 /*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+
+	Assert(dstate->nitems > 0);
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dstate->base,
+											 dstate->htids,
+											 dstate->nhtids);
+		final = postingtuple;
+	}
+
+	_bt_buildadd(wstate, state, final);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
+/*
  * Finish writing out the completed btree.
  */
 static void
@@ -1144,6 +1186,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	/* Don't use deduplication for INCLUDE indexes or unique indexes */
+	deduplicate = (keysz == IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1152,6 +1199,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		 * btspool and btspool2.
 		 */
 
+		Assert(!deduplicate);
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
@@ -1255,9 +1303,94 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->deduplicate = true; /* unused */
+		dstate->maxitemsize = 0;	/* set later */
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+				dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+				/* Conservatively size array */
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * BTMaxItemSize() limit.  Heap TID(s) for itup have been
+				 * saved in state.  The next iteration will also end up here
+				 * if it's possible to merge the next tuple into the same
+				 * pending posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * BTMaxItemSize() limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd..7460bf2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2694,85 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple.  Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple.  It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..2f741e1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
 #include "miscadmin.h"
 
+static MemoryContext opCtx;		/* working memory for operations */
+
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
@@ -181,9 +184,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_posting_split() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_posting_split() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_posting_split(newitem, oposting,
+										 xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +305,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				/*
+				 * Use _bt_posting_split() to repeat posting list split steps
+				 * from primary
+				 */
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_posting_split() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_posting_split(newitem, oposting,
+											 xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +366,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -380,14 +454,89 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 }
 
 static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState *state;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->deduplicate = true;	/* unused */
+		state->maxitemsize = BTMaxItemSize(page);
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval
+		 * to deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
+static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +627,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
+
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +989,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1009,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1037,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04..1dde2da 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff,
+								 xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84..22b2e93 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,149 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.  htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+	/* Deduplication status info for entire page/operation */
+	bool		deduplicate;	/* Still deduplicating page? */
+	Size		maxitemsize;	/* BTMaxItemSize() limit for page */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # valid heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+
+	/* Metadata about based tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* original page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
 
-/* Get/set downlink block number */
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +483,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +690,13 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -534,7 +731,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +762,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +781,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -730,8 +933,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+									OffsetNumber postingoff);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState* state, bool need_wal);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +971,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1023,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+										int nhtids);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee0..ebb39de 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +97,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +127,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +198,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +299,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2c..2b8c6c7 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}

#96

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#95)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 25, 2019 at 8:05 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Attached is v18. In this version bt_dedup_one_page() is refactored so that:
- no temp page is used, all updates are applied to the original page.
- each posting tuple wal logged separately.
This also allowed to simplify btree_xlog_dedup significantly.

This looks great! Even if it isn't faster than using a temp page
buffer, the flexibility seems like an important advantage. We can do
things like have the _bt_dedup_one_page() caller hint that
deduplication should start at a particular offset number. If that
doesn't work out by the time the end of the page is reached (whatever
"works out" may mean), then we can just start at the beginning of the
page, and work through the items we skipped over initially.

We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter.

Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and
_bt_load().

Maybe, but we also need to access a flag that says it's safe to use
deduplication. Obviously deduplication is not safe for datatypes like
numeric and text with a nondeterministic collation. The "is
deduplication safe for this index?" mechanism will probably work by
doing several catalog lookups. This doesn't seem like something we
want to do very often, especially with a buffer lock held -- ideally
it will be somewhere that's convenient to access.

Do we want to do that separately, and have a storage parameter that
says "I would like to use deduplication in principle, if it's safe"?
Or, do we store both pieces of information together, and forbid
setting the storage parameter to on when it's known to be unsafe for
the underlying opclasses used by the index? I don't know.

I think that you can start working on this without knowing exactly how
we'll do those catalog lookups. What you come up with has to work with
that before the patch can be committed, though.

--
Peter Geoghegan

#97

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 6 years ago

In reply to: Peter Geoghegan (#96)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

25.09.2019 22:14, Peter Geoghegan wrote:

We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter.

Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and
_bt_load().

Maybe, but we also need to access a flag that says it's safe to use
deduplication. Obviously deduplication is not safe for datatypes like
numeric and text with a nondeterministic collation. The "is
deduplication safe for this index?" mechanism will probably work by
doing several catalog lookups. This doesn't seem like something we
want to do very often, especially with a buffer lock held -- ideally
it will be somewhere that's convenient to access.

Do we want to do that separately, and have a storage parameter that
says "I would like to use deduplication in principle, if it's safe"?
Or, do we store both pieces of information together, and forbid
setting the storage parameter to on when it's known to be unsafe for
the underlying opclasses used by the index? I don't know.

I think that you can start working on this without knowing exactly how
we'll do those catalog lookups. What you come up with has to work with
that before the patch can be committed, though.

Attached is v19.

* It adds new btree reloption "deduplication".
I decided to refactor the code and move BtreeOptions into a separate
structure,
rather than adding new btree specific value to StdRelOptions.
Now it can be set even for indexes that do not support deduplication.
In that case it will be ignored. Should we add this check to option
validation?

* By default deduplication is on for non-unique indexes and off for
unique ones.

* New function _bt_dedup_is_possible() is intended to be a single place
to perform all the checks. Now it's just a stub to ensure that it works.

Is there a way to extract this from existing opclass information,
or we need to add new opclass field? Have you already started this work?
I recall there was another thread, but didn't manage to find it.

* I also integrated into this version your latest patch that enables
deduplication on unique indexes,
since now it can be easily switched on/off.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

v19-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v19-0001-Add-deduplication-to-nbtree.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..d65e2a7 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2032,6 +2111,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 }
 
 /*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
+/*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
  * we rely on having fully unique keys to find a match with only a single
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 20f4ed3..3fdf3a5 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1407,8 +1416,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..6e1dc59 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..54cb9db 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..3ef44cd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size newitemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.postingoff = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -428,14 +434,36 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			if (!ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
+				bool		posting;
 				bool		all_dead;
+				bool		posting_all_dead;
+				int			npost;
+
 
 				if (_bt_compare(rel, itup_key, page, offset) != 0)
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					posting = false;
+					posting_all_dead = true;
+				}
+				else
+				{
+					posting = true;
+					/* Initial assumption */
+					posting_all_dead = true;
+				}
+
+				npost = 0;
+doposttup:
+				if (posting)
+					htid = *BTreeTupleGetPostingN(curitup, npost);
+
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -446,6 +474,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					ItemPointerCompare(&htid, &itup->t_tid) == 0)
 				{
 					found = true;
+					posting_all_dead = false;
+					if (posting)
+						goto nextpost;
 				}
 
 				/*
@@ -511,8 +542,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,7 +600,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && !posting)
 				{
 					/*
 					 * The conflicting tuple (or whole HOT chain) is dead to
@@ -589,6 +619,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+				else if (posting)
+				{
+nextpost:
+					if (!all_dead)
+						posting_all_dead = false;
+
+					/* Iterate over single posting list tuple */
+					npost++;
+					if (npost < BTreeTupleGetNPosting(curitup))
+						goto doposttup;
+
+					/*
+					 * Mark posting tuple dead if all hot chains whose root is
+					 * contained in posting tuple have tuples that are all
+					 * dead
+					 */
+					if (posting_all_dead)
+					{
+						ItemIdMarkDead(curitemid);
+						opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+						if (nbuf != InvalidBuffer)
+							MarkBufferDirtyHint(nbuf, true);
+						else
+							MarkBufferDirtyHint(insertstate->buf, true);
+					}
+
+					/* Move on to next index tuple */
+				}
 			}
 		}
 
@@ -689,6 +748,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +811,25 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * deduplication is both possible and enabled, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (insertstate->itup_key->dedup_is_possible &&
+				BtreeGetDoDedupOption(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +911,37 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->dedup_is_possible);
+		/*
+		 * Don't check if the option is enabled,
+		 * since no actual deduplication will be done, just cleanup.
+		 * TODO Shouldn't we use _bt_vacuum_one_page() instead?
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +1002,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list.  Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context.  Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery.  Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page.  This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+				  OffsetNumber postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1086,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1105,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1127,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1140,46 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(postingoff > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_posting_split(itup, oposting, postingoff);
+
+		Assert(BTreeTupleGetNPosting(nposting) ==
+			   BTreeTupleGetNPosting(oposting));
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1211,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1291,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Posting list split requires an in-place update of the existing
+			 * posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1344,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1373,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1382,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1436,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1458,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1498,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1506,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1273,6 +1543,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1617,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1653,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1763,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1948,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1972,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in record, though.
+		 *
+		 * The details are often slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery can't even tell the difference).  Otherwise, we set
+		 * postingoff and log orignewitem instead of newitem, despite having
+		 * actually inserted newitem.  Recovery must reconstruct nposting and
+		 * newitem by repeating the actions of our caller (i.e. by passing
+		 * original posting list and orignewitem to _bt_posting_split()).
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2171,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2527,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2304,6 +2642,394 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   Size newitemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->deduplicate = true;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 * NOTE It's essential to calculate max offset on each iteration,
+	 * since it could have changed if several items were replaced with a
+	 * single posting tuple.
+	 */
+	offnum = minoff;
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data
+			 * item as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (state->deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list, and
+			 * merging itup into pending posting list won't exceed the
+			 * BTMaxItemSize() limit.  Heap TID(s) for itup have been saved in
+			 * state.  The next iteration will also end up here if it's
+			 * possible to merge the next tuple into the same pending posting
+			 * list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * BTMaxItemSize() limit was reached.
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page,
+			 * otherwise, just reset the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+			/*
+			 * When we have deduplicated enough to avoid page split, don't
+			 * bother merging together existing tuples to create new posting
+			 * lists.
+			 *
+			 * Note: We deliberately add as many heap TIDs as possible to a
+			 * pending posting list by performing this check at this point
+			 * (just before a new pending posting lists is created).  It would
+			 * be possible to make the final new posting list for each
+			 * successful page deduplication operation as small as possible
+			 * while still avoiding a page split for caller.  We don't want to
+			 * repeatedly merge posting lists around the same range of heap
+			 * TIDs, though.
+			 *
+			 * (Besides, the total number of new posting lists created is the
+			 * cost that this check is supposed to minimize -- there is no
+			 * great reason to be concerned about the absolute number of
+			 * existing tuples that can be killed/replaced.)
+			 */
+#if 0
+			/* Actually, don't do that */
+			/* TODO: Make a final decision on this */
+			if (pagesaving >= newitemsz)
+				state->deduplicate = false;
+#endif
+
+			/* Continue iteration from base tuple's offnum */
+			offnum = state->baseoff;
+
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/*
+	 * Handle the last item, if pending posting list is not empty.
+	 */
+	if (state->nitems != 0)
+		pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	if (state->nitems > 1)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int 		 ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = BTreeFormPostingTuple(state->base, state->htids,
+									state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..1b1134c2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,17 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level, bool dedup_is_possible)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +69,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_dedup_is_possible = dedup_is_possible;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -213,6 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -394,6 +402,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -684,6 +693,59 @@ _bt_heapkeyspace(Relation rel)
 }
 
 /*
+ *	_bt_get_dedupispossible() -- is deduplication possible for the index?
+ * 	get information from metapage
+ */
+bool
+_bt_getdedupispossible(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			_bt_relbuf(rel, metabuf);
+			return metad->btm_dedup_is_possible;;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 *
+		 * An on-the-fly version upgrade performed by _bt_upgrademetapage()
+		 * can change the nbtree version for an index without invalidating any
+		 * local cache.  This is okay because it can only happen when moving
+		 * from version 2 to version 3, both of which are !heapkeyspace
+		 * versions.
+		 */
+		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+											 sizeof(BTMetaPageData));
+		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+	/* We shouldn't have cached it if any of these fail */
+	Assert(metad->btm_magic == BTREE_MAGIC);
+	Assert(metad->btm_version >= BTREE_MIN_VERSION);
+	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(metad->btm_fastroot != P_NONE);
+
+	return metad->btm_dedup_is_possible;
+}
+
+/*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
 void
@@ -983,14 +1045,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1120,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1135,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1042,6 +1157,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 }
 
 /*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
+/*
  * Delete item(s) from a btree page during single-page cleanup.
  *
  * As above, must only be used on leaf pages.
@@ -1067,8 +1267,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2066,6 +2266,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..0d89961 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -157,10 +159,11 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	bool dedup_is_possible = _bt_dedup_is_possible(index);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, dedup_is_possible);
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -263,8 +266,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -816,7 +819,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	}
 	else
 	{
-		StdRdOptions *relopts;
+		BtreeOptions *relopts;
 		float8		cleanup_scale_factor;
 		float8		prev_num_heap_tuples;
 
@@ -827,7 +830,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
 		 * original tuples count.
 		 */
-		relopts = (StdRdOptions *) info->index->rd_options;
+		relopts = (BtreeOptions *) info->index->rd_options;
 		cleanup_scale_factor = (relopts &&
 								relopts->vacuum_cleanup_index_scale_factor >= 0)
 			? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1072,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1192,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1242,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1255,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1280,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+															 nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1352,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1368,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1379,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1394,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1404,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1376,6 +1457,68 @@ restart:
 }
 
 /*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..9022ee6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -529,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 }
 
 /*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
+/*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
  *	page/offnum: location of btree item to be compared to.
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +806,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1560,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1595,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1700,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1743,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1757,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1611,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a base version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..cff252b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -725,7 +728,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	if (level > 0)
 		state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
 	else
-		state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
+		state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
 														  BTREE_DEFAULT_FILLFACTOR);
 	/* no parent level, yet */
 	state->btps_next = NULL;
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1058,6 +1064,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 }
 
 /*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+
+	Assert(dstate->nitems > 0);
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dstate->base,
+											 dstate->htids,
+											 dstate->nhtids);
+		final = postingtuple;
+	}
+
+	_bt_buildadd(wstate, state, final);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
+/*
  * Finish writing out the completed btree.
  */
 static void
@@ -1123,7 +1165,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+
+	_bt_initmetapage(metapage, rootblkno, rootlevel, wstate->inskey->dedup_is_possible);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1144,6 +1187,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->dedup_is_possible
+				  && BtreeGetDoDedupOption(wstate->index);
 
 	if (merge)
 	{
@@ -1152,6 +1199,13 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		 * btspool and btspool2.
 		 */
 
+		/*
+		 * Unique indexes may support deduplication, but this case
+		 * it seems unworthy.
+		 * TODO Probably we can just delete the assertion.
+		 */
+		deduplicate = false;
+		Assert(!deduplicate);
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
@@ -1255,9 +1309,94 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->deduplicate = true; /* unused */
+		dstate->maxitemsize = 0;	/* set later */
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+				dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+				/* Conservatively size array */
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * BTMaxItemSize() limit.  Heap TID(s) for itup have been
+				 * saved in state.  The next iteration will also end up here
+				 * if it's possible to merge the next tuple into the same
+				 * pending posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * BTMaxItemSize() limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..df976d4 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
 
 	/* Count up total space in data items before actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+	leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd..e6a64f8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,23 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+	/* get information from relation info or from btree metapage */
+	key->dedup_is_possible = (itup == NULL) ? _bt_dedup_is_possible(rel) :
+											  _bt_getdedupispossible(rel);
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1398,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1560,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1800,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2027,7 +2066,30 @@ BTreeShmemInit(void)
 bytea *
 btoptions(Datum reloptions, bool validate)
 {
-	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+	relopt_value *options;
+	BtreeOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+					offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL, offsetof(BtreeOptions, do_deduplication)}
+	};
+
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		return NULL;
+
+	rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+	return (bytea *) rdopts;
 }
 
 /*
@@ -2140,6 +2202,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2236,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2245,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2270,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2289,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2302,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2313,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2331,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2340,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2422,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2465,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2548,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2615,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2646,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2720,115 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple.  Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple.  It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+bool
+_bt_dedup_is_possible(Relation index)
+{
+	int dedup_is_possible = false;
+
+	if (IndexRelationGetNumberOfAttributes(index)
+		== IndexRelationGetNumberOfKeyAttributes(index))
+	{
+		int i;
+
+		dedup_is_possible = true;
+
+		for (i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+		{
+			Oid opfamily = index->rd_opfamily[i];
+			Oid collation = index->rd_indcollation[i];
+
+			// TODO add adequate check of opclasses and collations
+			elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+				RelationGetRelationName(index), i, opfamily, collation);
+			if (opfamily == 1988) //NUMERIC BTREE OPFAMILY
+			{
+				return false;
+			}
+		}
+	}
+
+	return dedup_is_possible;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..3489cf2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
 #include "miscadmin.h"
 
+static MemoryContext opCtx;		/* working memory for operations */
+
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_dedup_is_possible = xlrec->btm_dedup_is_possible;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -181,9 +185,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_posting_split() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_posting_split() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_posting_split(newitem, oposting,
+										 xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +306,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				/*
+				 * Use _bt_posting_split() to repeat posting list split steps
+				 * from primary
+				 */
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_posting_split() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_posting_split(newitem, oposting,
+											 xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +367,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -380,14 +455,89 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 }
 
 static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState *state;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->deduplicate = true;	/* unused */
+		state->maxitemsize = BTMaxItemSize(page);
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval
+		 * to deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
+static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +628,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
+
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +990,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1010,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1038,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04..1dde2da 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff,
+								 xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84..3ef752c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -107,11 +107,40 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_dedup_is_possible; /* whether the deduplication
+										* can be applied to the index */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/* Storage type for Btree's reloptions */
+typedef struct BtreeOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int	 fillfactor;
+	double vacuum_cleanup_index_scale_factor;
+	bool do_deduplication;
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ */
+#define BtreeDefaultDoDedup(relation) \
+	(relation->rd_index->indisunique ? false : true)
+
+#define BtreeGetDoDedupOption(relation) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->do_deduplication : BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
 /*
  * The current Btree version is 4.  That's what you'll get when you create
  * a new index.
@@ -234,8 +263,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +280,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +341,149 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.  htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+	/* Deduplication status info for entire page/operation */
+	bool		deduplicate;	/* Still deduplicating page? */
+	Size		maxitemsize;	/* BTMaxItemSize() limit for page */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # valid heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+
+	/* Metadata about based tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* original page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
 
-/* Get/set downlink block number */
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +512,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -472,6 +691,7 @@ typedef struct BTScanInsertData
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
+	bool		dedup_is_possible;
 	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys array */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
@@ -500,6 +720,13 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -534,7 +761,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +792,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +811,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -730,8 +963,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+									OffsetNumber postingoff);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState* state, bool need_wal);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -743,7 +982,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool dedup_is_possible);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
@@ -751,6 +991,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_getdedupispossible(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1003,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1055,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+										int nhtids);
+extern bool _bt_dedup_is_possible(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee0..71f6568 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		btm_dedup_is_possible;
 } xl_btree_metadata;
 
 /*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2c..2b8c6c7 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}

#98

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Anastasia Lubennikova (#97)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Attached is v19.

Cool.

* By default deduplication is on for non-unique indexes and off for
unique ones.

I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates. I have been benchmarking this using more or less
standard pgbench at scale 200, with one big difference -- I also
create an index on "pgbench_accounts (abalance)". This is a low
cardinality index, which ends up about 3x smaller with the patch, as
expected. It also makes all updates non-HOT updates, greatly
increasing index bloat in the primary key of the accounts table --
this is what I found really interesting about this workload.

The theory behind deduplication within unique indexes seems quite
different to the cases we've focussed on so far -- that's why my
working copy of the patch makes a few small changes to how
_bt_dedup_one_page() works with unique indexes specifically (more on
that later). With unique indexes, deduplication doesn't help by
creating space -- it helps by creating *time* for garbage collection
to run before the real "damage" is done -- it delays page splits. This
is only truly valuable when page splits caused by non-HOT updates are
delayed by so much that they're actually prevented entirely, typically
because the _bt_vacuum_one_page() stuff can now happen before pages
split, not after. In general, these page splits are bad because they
degrade the B-Tree structure, more or less permanently (it's certainly
permanent with this workload). Having a huge number of page splits
*purely* because of non-HOT updates is particular bad -- it's just
awful. I believe that this is the single biggest problem with the
Postgres approach to versioned storage (we know that other DB systems
have no primary key page splits with this kind of workload).

Anyway, if you run this pgbench workload without rate-limiting, so
that a patched Postgres does as much work as physically possible, the
accounts table primary key (pgbench_accounts_pkey) at least grows at a
slower rate -- the patch clearly beats master at the start of the
benchmark/test (as measured by index size). As the clients are ramped
up by my testing script, and as time goes on, eventually the size of
the pgbench_accounts_pkey index "catches up" with master. The patch
delays page splits, but ultimately the system as a whole cannot
prevent the page splits altogether, since the server is being
absolutely hammered by pgbench. Actually, the index is *exactly* the
same size for both the master case and the patch case when we reach
this "bloat saturation point". We can delay the problem, but we cannot
prevent it. But what about a more realistic workload, with
rate-limiting?

When I add some rate limiting, so that the TPS/throughput is at about
50% of what I got the first time around (i.e. 50% of what is
possible), or 15k TPS, it's very different. _bt_dedup_one_page() can
now effectively cooperate with _bt_vacuum_one_page(). Now
deduplication is able to "soak up all the extra garbage tuples" for
long enough to delay and ultimately *prevent* almost all page splits.
pgbench_accounts_pkey starts off at 428 MB for both master and patch
(CREATE INDEX makes it that size). After about an hour, the index is
447 MB with the patch. The master case ends up with a
pgbench_accounts_pkey size of 854 MB, though (this is very close to
857 MB, the "saturation point" index size from before).

This is a very significant improvement, obviously -- the patch has an
index that is ~52% of the size seen for the same index with the master
branch!

Here is how I changed _bt_dedup_one_page() for unique indexes to get
this result:

* We limit the size of posting lists to 5 heap TIDs in the
checkingunique case. Right now, we will actually accept a
checkingunique page split before we'll merge together items that
result in a posting list with more heap TIDs than that (not sure about
these details at all, though).

* Avoid creating a new posting list that caller will have to split
immediately anyway (this is based on details of _bt_dedup_one_page()
caller's newitem tuple).

(Not sure how much this customization contributes to this favorable
test result -- maybe it doesn't make that much difference.)

The goal here is for duplicates that are close together in both time
and space to get "clumped together" into their own distinct, small-ish
posting list tuples with no more than 5 TIDs. This is intended to help
_bt_vacuum_one_page(), which is known to be a very important mechanism
for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set
very frequently within _bt_check_unique()). The general idea is to
balance deduplication against LP_DEAD killing, and to increase
spatial/temporal locality within these smaller posting lists. If we
have one huge posting list for each value, then we can't set the
LP_DEAD bit on anything at all, which is very bad. If we have a few
posting lists that are not so big for each distinct value, we can
often kill most of them within _bt_vacuum_one_page(), which is very
good, and has minimal downside (i.e. we still get most of the benefits
of aggressive deduplication).

Interestingly, these non-HOT page splits all seem to "come in waves".
I noticed this because I carefully monitored the benchmark/test case
over time. The patch doesn't prevent the "waves of page splits"
pattern, but it does make it much much less noticeable.

* New function _bt_dedup_is_possible() is intended to be a single place
to perform all the checks. Now it's just a stub to ensure that it works.

Is there a way to extract this from existing opclass information,
or we need to add new opclass field? Have you already started this work?
I recall there was another thread, but didn't manage to find it.

The thread is here:
/messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com

--
Peter Geoghegan

#99

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#98)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Sep 27, 2019 at 7:02 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates.

Attached is v20, which adds a custom strategy for the checkingunique
(unique index) case to _bt_dedup_one_page(). It also makes
deduplication the default for both unique and non-unique indexes. I
simply altered your new BtreeDefaultDoDedup() macro from v19 to make
nbtree use deduplication wherever it is safe to do so. This default
may not be the best one in the end, though deduplication in unique
indexes now looks very compelling.

The new checkingunique heuristics added to _bt_dedup_one_page() were
developed experimentally, based on pgbench tests. The general idea
with the new checkingunique stuff is to make deduplication *extremely*
lazy. We want to avoid making _bt_vacuum_one_page() garbage collection
less effective by being too aggressive with deduplication -- workloads
with lots of non-HOT-updates into unique indexes are greatly dependent
on the LP_DEAD bit setting in _bt_check_unique(). At the same time,
_bt_dedup_one_page() can be just as effective at delaying page splits
as it is with non-unique indexes.

I've found that my "regular pgbench, but with a low cardinality index
on pgbench_accounts(abalance)" benchmark works best with the specific
heuristics used in the patch, especially over many hours. I spent
nearly 24 hours running the test at full speed (no throttling this
time), at scale 500, and with very very aggressive autovacuum settings
(autovacuum_vacuum_cost_delay=0ms,
autovacuum_vacuum_scale_factor=0.02). Each run lasted one hour, with
alternating runs of 4, 8, and 16 clients. Towards the end, the patch
had about 5% greater throughput at lower client counts, and never
seemed to be significantly slower (it was very slightly slower once or
twice, but I think that that was just noise).

More importantly, the indexes looked like this on master:

bloated_abalance: 3017 MB
pgbench_accounts_pkey: 2142 MB
pgbench_branches_pkey: 1352 kB
pgbench_tellers_pkey: 3416 kB

And like this with the patch:

bloated_abalance: 1015 MB
pgbench_accounts_pkey: 1745 MB
pgbench_branches_pkey: 296 kB
pgbench_tellers_pkey: 888 kB

* bloated_abalance is about 3x smaller here, as usual -- no surprises there.

* pgbench_accounts_pkey is the most interesting case.

You might think that it isn't that great that pgbench_accounts_pkey is
1745 MB with the patch, since it starts out at only 1071 MB (and would
go back down to 1071 MB again if we were to do a REINDEX). However,
you have to bear in mind that it takes a long time for it to get that
big -- the growth over time is very important here. Even after the
first run with 16 clients, it only reached 1160 MB -- that's an
increase of ~8%. The master case had already reached 2142 MB ("bloat
saturation point") by then, though. I could easily have stopped the
benchmark there, or used rate-limiting, or excluded the 16 client case
-- that would have allowed me to claim that the growth was under 10%
for a workload where the master case has an index that doubles in
size. On the other hand, if autovacuum wasn't configured to run very
frequently, then the patch wouldn't look nearly this good.
Deduplication helped autovacuum by "soaking up" the "recently dead"
index tuples that cannot be killed right away. In short, the patch
ameliorates weaknesses of the existing garbage collection mechanisms
without changing them. The patch smoothed out the growth of
pgbench_accounts_pkey over many hours. As I said, it was only 1160 MB
after the first 3 hours/first 16 client run. It was 1356 MB after the
second 16 client run (i.e. after running another round of one hour
4/8/16 client runs), finally finishing up at 1745 MB. So the growth in
the size of pgbench_accounts_pkey for the patch was significantly
improved, and the *rate* of growth over time was also improved.

The master branch had a terrible jerky growth in the size of
pgbench_accounts_pkey. The master branch did mostly keep up at first
(i.e. the size of pgbench_accounts_pkey wasn't too different at
first). But once we got to 16 clients for the first time, after a
couple of hours, pgbench_accounts_pkey almost doubled in size over a
period of only 10 or 20 minutes! The index size *exploded* in a very
short period of time, starting only a few hours into the benchmark.
(Maybe we don't see this anything like this with the patch because
with the patch backends are more concerned about helping VACUUM, and
less concerned about creating a mess that VACUUM must clean up. Not
sure.)

* We also manage to make the small pgbench indexes
(pgbench_branches_pkey and pgbench_tellers_pkey) over 4x smaller here
(without doing anything to force more non-HOT updates on the
underlying tables).

This result for the two small indexes looks good, but I should point
out that we still only fit ~15 or so tuples on each leaf page with the
patch when everything is over -- far far less than the number that
CREATE INDEX stored on the leaf pages immediately (it leaves 366 items
on each leaf page). This is kind of an extreme case, because there is
so much contention, but space utilization with the patch is actually
very bad here. The master branch is very very very bad, though, so
we're at least down to only a single "very" here. Progress.

Any thoughts on the approach taken for unique indexes within
_bt_dedup_one_page() in v20? Obviously that stuff needs to be examined
critically -- it's possible that it wouldn't do as well as it could or
should with other workloads that I haven't thought about. Please take
a look at the details.

--
Peter Geoghegan

Attachments:

v20-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v20-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 5662b08800e0caef28ebe8d27c3512a993d40130 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v20 2/2] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples).  Also adds a column that shows whether or not the
LP_DEAD bit has been set.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 92 ++++++++++++++++---
 contrib/pageinspect/expected/btree.out        |  6 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
 3 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..e88875107f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[10];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer min_htid,
+				max_htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (rel && !_bt_heapkeyspace(rel))
+	{
+		min_htid = NULL;
+		max_htid = NULL;
+	}
+	else
+	{
+		min_htid = BTreeTupleGetHeapTID(itup);
+		if (BTreeTupleIsPosting(itup))
+			max_htid = BTreeTupleGetMaxHeapTID(itup);
+		else
+			max_htid = NULL;
+	}
+
+	if (min_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(min_htid),
+							   ItemPointerGetOffsetNumberNoCheck(min_htid));
+	else
+		values[j++] = NULL;
+
+	if (max_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(max_htid),
+							   ItemPointerGetOffsetNumberNoCheck(max_htid));
+	else
+		values[j++] = NULL;
+
+	if (min_htid == NULL)
+		values[j++] = psprintf("0");
+	else if (!BTreeTupleIsPosting(itup))
+		values[j++] = psprintf("1");
+	else
+		values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+	if (!ItemIdIsDead(id))
+		values[j++] = psprintf("f");
+	else
+		values[j++] = psprintf("t");
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +431,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +462,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
+max_htid   | 
+nheap_tids | 1
+isdead     | f
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid,
+    OUT max_htid tid,
+    OUT nheap_tids int4,
+    OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v20-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v20-0001-Add-deduplication-to-nbtree.patchDownload

From 4fd6fa5c21b79f56f5d3f8f8881778a3d8fb82c5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v20 1/2] Add deduplication to nbtree

---
 contrib/amcheck/verify_nbtree.c         | 164 ++++-
 src/backend/access/common/reloptions.c  |  11 +-
 src/backend/access/index/genam.c        |   4 +
 src/backend/access/nbtree/README        |  74 +-
 src/backend/access/nbtree/nbtinsert.c   | 860 +++++++++++++++++++++++-
 src/backend/access/nbtree/nbtpage.c     | 211 +++++-
 src/backend/access/nbtree/nbtree.c      | 175 ++++-
 src/backend/access/nbtree/nbtsearch.c   | 244 ++++++-
 src/backend/access/nbtree/nbtsort.c     | 144 +++-
 src/backend/access/nbtree/nbtsplitloc.c |  49 +-
 src/backend/access/nbtree/nbtutils.c    | 326 ++++++++-
 src/backend/access/nbtree/nbtxlog.c     | 222 +++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  28 +-
 src/include/access/nbtree.h             | 319 ++++++++-
 src/include/access/nbtxlog.h            |  68 +-
 src/include/access/rmgrlist.h           |   2 +-
 src/tools/valgrind.supp                 |  21 +
 17 files changed, 2732 insertions(+), 190 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..bdb0ede577 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index b5072c00fe..e6448e4a86 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1513,8 +1522,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..3d213dfd2d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,27 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +129,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.postingoff = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +307,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -428,14 +435,36 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			if (!ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
+				bool		posting;
 				bool		all_dead;
+				bool		posting_all_dead;
+				int			npost;
+
 
 				if (_bt_compare(rel, itup_key, page, offset) != 0)
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					posting = false;
+					posting_all_dead = true;
+				}
+				else
+				{
+					posting = true;
+					/* Initial assumption */
+					posting_all_dead = true;
+				}
+
+				npost = 0;
+		doposttup:
+				if (posting)
+					htid = *BTreeTupleGetPostingN(curitup, npost);
+
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -446,6 +475,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					ItemPointerCompare(&htid, &itup->t_tid) == 0)
 				{
 					found = true;
+					posting_all_dead = false;
+					if (posting)
+						goto nextpost;
 				}
 
 				/*
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,7 +601,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && !posting)
 				{
 					/*
 					 * The conflicting tuple (or whole HOT chain) is dead to
@@ -589,6 +620,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+				else if (posting)
+				{
+			nextpost:
+					if (!all_dead)
+						posting_all_dead = false;
+
+					/* Iterate over single posting list tuple */
+					npost++;
+					if (npost < BTreeTupleGetNPosting(curitup))
+						goto doposttup;
+
+					/*
+					 * Mark posting tuple dead if all hot chains whose root is
+					 * contained in posting tuple have tuples that are all
+					 * dead
+					 */
+					if (posting_all_dead)
+					{
+						ItemIdMarkDead(curitemid);
+						opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+						if (nbuf != InvalidBuffer)
+							MarkBufferDirtyHint(nbuf, true);
+						else
+							MarkBufferDirtyHint(insertstate->buf, true);
+					}
+
+					/* Move on to next index tuple */
+				}
 			}
 		}
 
@@ -689,6 +749,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +812,26 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * deduplication is both possible and enabled, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (insertstate->itup_key->dedup_is_possible &&
+				BtreeGetDoDedupOption(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +913,38 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->dedup_is_possible);
+
+		/*
+		 * Don't check if the option is enabled, since no actual deduplication
+		 * will be done, just cleanup.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, checkingunique);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +1005,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list.  Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context.  Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery.  Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page.  This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+				  OffsetNumber postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1089,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1108,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1130,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1142,46 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(postingoff > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_posting_split(itup, oposting, postingoff);
+
+		Assert(BTreeTupleGetNPosting(nposting) ==
+			   BTreeTupleGetNPosting(oposting));
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1214,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1294,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Posting list split requires an in-place update of the existing
+			 * posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1347,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1376,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1385,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1439,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1461,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1501,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1549,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1623,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1659,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1769,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1954,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1978,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in record, though.
+		 *
+		 * The details are often slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery can't even tell the difference).  Otherwise, we set
+		 * postingoff and log orignewitem instead of newitem, despite having
+		 * actually inserted newitem.  Recovery must reconstruct nposting and
+		 * newitem by repeating the actions of our caller (i.e. by passing
+		 * original posting list and orignewitem to _bt_posting_split()).
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2177,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2533,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2304,6 +2648,472 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (We'll have to kill LP_DEAD
+ * items here when the page's BTP_HAS_GARBAGE hint was not set, but that
+ * should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different.  Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique().  Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set.  Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list.  Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether.  checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->rel = rel;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	state->newitem = newitem;
+	state->checkingunique = checkingunique;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	state->overlap = false;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+retry:
+	offnum = minoff;
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list, and
+			 * merging itup into pending posting list won't exceed the
+			 * BTMaxItemSize() limit.  Heap TID(s) for itup have been saved in
+			 * state.  The next iteration will also end up here if it's
+			 * possible to merge the next tuple into the same pending posting
+			 * list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * BTMaxItemSize() limit was reached.
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page, otherwise, just
+			 * reset the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state,
+												   RelationNeedsWAL(rel));
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication.  Don't
+			 * prematurely deduplicate items that could still have their
+			 * LP_DEAD bits set.
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/* Continue iteration from base tuple's offnum */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+		pagesaving += _bt_dedup_finish_pending(buffer, state,
+											   RelationNeedsWAL(rel));
+
+	if (state->checkingunique && pagesaving < newitemsz)
+	{
+		/*
+		 * Try again.  The second pass over the page may deduplicate items
+		 * that were passed over the first time due to concerns about limiting
+		 * the effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that we will still stop deduplicating as soon as enough space
+		 * has been freed to avoid caller's page split.
+		 *
+		 * FIXME: Don't bother with this when it's clearly a total waste of
+		 * time.  Maybe don't do any checkingunique deduplication for the
+		 * rightmost page, either.
+		 */
+		state->checkingunique = false;
+		state->alltupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+	state->overlap = false;
+	if (state->newitem)
+	{
+		/* Might overlap with new item -- mark it as possible if it is */
+		if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+			state->overlap = true;
+	}
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists with checkingunique */
+	if (state->checkingunique && BTreeTupleIsPosting(state->base))
+		return false;
+	if (state->checkingunique && nhtids > 1)
+		return false;
+
+	if (state->overlap)
+	{
+		if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+		{
+			/*
+			 * newitem has heap TID in the range of the would-be new posting
+			 * list.  Avoid an immediate posting list split for caller.
+			 */
+			if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+				IndexRelationGetNumberOfAttributes(state->rel))
+			{
+				state->newitem = NULL;	/* avoid unnecessary comparisons */
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique ||
+		   state->nitems == 1 || state->nhtids == state->nitems);
+	if (state->checkingunique)
+		minimum = 3;
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = BTreeFormPostingTuple(state->base, state->htids,
+									  state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..c08f850595 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,17 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level, bool dedup_is_possible)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +69,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_dedup_is_possible = dedup_is_possible;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -213,6 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -394,6 +402,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -683,6 +692,63 @@ _bt_heapkeyspace(Relation rel)
 	return metad->btm_version > BTREE_NOVAC_VERSION;
 }
 
+/*
+ *	_bt_get_dedupispossible() -- is deduplication possible for the index?
+ * 	get information from metapage
+ */
+bool
+_bt_getdedupispossible(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 *
+		 * FIXME: Think some more about pg_upgrade'd !heapkeyspace indexes
+		 * here, and the need for aa version bump to go with new metapage
+		 * field.
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			_bt_relbuf(rel, metabuf);
+			return metad->btm_dedup_is_possible;;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 *
+		 * An on-the-fly version upgrade performed by _bt_upgrademetapage()
+		 * can change the nbtree version for an index without invalidating any
+		 * local cache.  This is okay because it can only happen when moving
+		 * from version 2 to version 3, both of which are !heapkeyspace
+		 * versions.
+		 */
+		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+											 sizeof(BTMetaPageData));
+		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+	/* We shouldn't have cached it if any of these fail */
+	Assert(metad->btm_magic == BTREE_MAGIC);
+	Assert(metad->btm_version >= BTREE_MIN_VERSION);
+	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(metad->btm_fastroot != P_NONE);
+
+	return metad->btm_dedup_is_possible;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -983,14 +1049,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1124,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1139,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1160,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1271,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2066,6 +2270,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..d70607e71a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -157,10 +159,11 @@ void
 btbuildempty(Relation index)
 {
 	Page		metapage;
+	bool		dedup_is_possible = _bt_dedup_is_possible(index);
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, dedup_is_possible);
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -263,8 +266,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -816,7 +819,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	}
 	else
 	{
-		StdRdOptions *relopts;
+		BtreeOptions *relopts;
 		float8		cleanup_scale_factor;
 		float8		prev_num_heap_tuples;
 
@@ -827,7 +830,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
 		 * original tuples count.
 		 */
-		relopts = (StdRdOptions *) info->index->rd_options;
+		relopts = (BtreeOptions *) info->index->rd_options;
 		cleanup_scale_factor = (relopts &&
 								relopts->vacuum_cleanup_index_scale_factor >= 0)
 			? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1072,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1192,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1242,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1255,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1280,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+															 nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1352,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1368,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1379,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1394,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1404,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1375,6 +1456,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..9db73d070d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1233,6 +1343,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
+	inskey.dedup_is_possible = false;
 	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1773,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a base version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..a138fafeb1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -725,8 +728,8 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	if (level > 0)
 		state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
 	else
-		state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
-														  BTREE_DEFAULT_FILLFACTOR);
+		state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+													   BTREE_DEFAULT_FILLFACTOR);
 	/* no parent level, yet */
 	state->btps_next = NULL;
 
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1057,6 +1063,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+
+	Assert(dstate->nitems > 0);
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dstate->base,
+											 dstate->htids,
+											 dstate->nhtids);
+		final = postingtuple;
+	}
+
+	_bt_buildadd(wstate, state, final);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1123,7 +1165,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+
+	_bt_initmetapage(metapage, rootblkno, rootlevel, wstate->inskey->dedup_is_possible);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1144,6 +1187,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->dedup_is_possible &&
+		BtreeGetDoDedupOption(wstate->index);
 
 	if (merge)
 	{
@@ -1255,9 +1302,96 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->newitem = NULL;
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->overlap = false;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+				dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+				/* Conservatively size array */
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * BTMaxItemSize() limit.  Heap TID(s) for itup have been
+				 * saved in state.  The next iteration will also end up here
+				 * if it's possible to merge the next tuple into the same
+				 * pending posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * BTMaxItemSize() limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..df976d4b7a 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
 
 	/* Count up total space in data items before actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+	leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..6fdd776ea5 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,23 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+	/* get information from relation info or from btree metapage */
+	key->dedup_is_possible = (itup == NULL) ? _bt_dedup_is_possible(rel) :
+		_bt_getdedupispossible(rel);
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1398,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1560,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1800,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2027,7 +2066,30 @@ BTreeShmemInit(void)
 bytea *
 btoptions(Datum reloptions, bool validate)
 {
-	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+	relopt_value *options;
+	BtreeOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+		offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL, offsetof(BtreeOptions, do_deduplication)}
+	};
+
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		return NULL;
+
+	rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+	return (bytea *) rdopts;
 }
 
 /*
@@ -2140,6 +2202,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2236,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2245,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2270,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2289,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2302,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2313,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2331,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2340,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2422,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2465,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2548,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2615,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2646,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2720,119 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple.  Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple.  It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Note: This does not account for pg_uggrade'd !heapkeyspace indexes
+ */
+bool
+_bt_dedup_is_possible(Relation index)
+{
+	int			dedup_is_possible = false;
+
+	if (IndexRelationGetNumberOfAttributes(index) ==
+		IndexRelationGetNumberOfKeyAttributes(index))
+	{
+		int			i;
+
+		dedup_is_possible = true;
+
+		for (i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+		{
+			Oid			opfamily = index->rd_opfamily[i];
+			Oid			collation = index->rd_indcollation[i];
+
+			/* TODO add adequate check of opclasses and collations */
+			elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+				 RelationGetRelationName(index), i, opfamily, collation);
+			/* NUMERIC BTREE OPFAMILY OID is 1988 */
+			if (opfamily == 1988)
+			{
+				return false;
+			}
+		}
+	}
+
+	return dedup_is_possible;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..747ab4235c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
 #include "miscadmin.h"
 
+static MemoryContext opCtx;		/* working memory for operations */
+
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_dedup_is_possible = xlrec->btm_dedup_is_possible;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -181,9 +185,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_posting_split() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_posting_split() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_posting_split(newitem, oposting,
+										 xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +306,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				/*
+				 * Use _bt_posting_split() to repeat posting list split steps
+				 * from primary
+				 */
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_posting_split() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_posting_split(newitem, oposting,
+											 xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +367,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +454,83 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState *state;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->maxitemsize = BTMaxItemSize(page);
+		state->checkingunique = false;	/* unused */
+		state->newitem = NULL;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		state->overlap = false;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval to
+		 * deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +538,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +630,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +992,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1012,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1040,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff,
+								 xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..593f74c26e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -107,11 +107,43 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_dedup_is_possible;	/* whether the deduplication can be
+										 * applied to the index */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/* Storage type for Btree's reloptions */
+typedef struct BtreeOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;
+	double		vacuum_cleanup_index_scale_factor;
+	bool		do_deduplication;
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ *
+ * XXX: Actually, we use deduplication everywhere for now.  Re-review this
+ * decision later on.
+ */
+#define BtreeDefaultDoDedup(relation) \
+	(relation->rd_index->indisunique ? true : true)
+
+#define BtreeGetDoDedupOption(relation) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->do_deduplication : BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
 /*
  * The current Btree version is 4.  That's what you'll get when you create
  * a new index.
@@ -234,8 +266,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +283,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +344,152 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
 
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.  htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+	Relation	rel;
+	/* Deduplication status info for entire page/operation */
+	Size		maxitemsize;	/* BTMaxItemSize() limit for page */
+	IndexTuple	newitem;
+	bool		checkingunique;
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # valid heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+	bool		overlap;		/* Avoid overlapping posting lists? */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +518,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -472,6 +697,7 @@ typedef struct BTScanInsertData
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
+	bool		dedup_is_possible;
 	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys array */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
@@ -499,6 +725,13 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +767,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +798,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +817,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -730,8 +969,15 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+									OffsetNumber postingoff);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+									 bool need_wal);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -743,7 +989,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool dedup_is_possible);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
@@ -751,6 +998,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_getdedupispossible(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1010,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1062,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+										int nhtids);
+extern bool _bt_dedup_is_possible(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..71f6568234 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		btm_dedup_is_possible;
 } xl_btree_metadata;
 
 /*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
-- 
2.17.1

#100

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Peter Geoghegan (#99)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote:

I've found that my "regular pgbench, but with a low cardinality index
on pgbench_accounts(abalance)" benchmark works best with the specific
heuristics used in the patch, especially over many hours.

I ran pgbench without the pgbench_accounts(abalance) index, and with
slightly adjusted client counts -- you could say that this was a
classic pgbench benchmark of v20 of the patch. Still scale 500, with
single hour runs.

Here are the results for each 1 hour run, with client counts of 8, 16,
and 32, with two rounds of runs total:

master_1_run_8.out: "tps = 25156.689415 (including connections establishing)"
patch_1_run_8.out: "tps = 25135.472084 (including connections establishing)"
master_1_run_16.out: "tps = 30947.053714 (including connections establishing)"
patch_1_run_16.out: "tps = 31225.044305 (including connections establishing)"
master_1_run_32.out: "tps = 29550.231969 (including connections establishing)"
patch_1_run_32.out: "tps = 29425.011249 (including connections establishing)"

master_2_run_8.out: "tps = 24678.792084 (including connections establishing)"
patch_2_run_8.out: "tps = 24891.130465 (including connections establishing)"
master_2_run_16.out: "tps = 30878.930585 (including connections establishing)"
patch_2_run_16.out: "tps = 30982.306091 (including connections establishing)"
master_2_run_32.out: "tps = 29555.453436 (including connections establishing)"
patch_2_run_32.out: "tps = 29591.767136 (including connections establishing)"

This interlaced order is the same order that each 1 hour pgbench run
actually ran in. The patch wasn't expected to do any better here -- it
was expected to not be any slower for a workload that it cannot really
help. Though the two small pgbench indexes do remain a lot smaller
with the patch.

While a lot of work remains to validate the performance of the patch,
this looks good to me.

--
Peter Geoghegan

#101

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#99)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v20, which adds a custom strategy for the checkingunique
(unique index) case to _bt_dedup_one_page(). It also makes
deduplication the default for both unique and non-unique indexes. I
simply altered your new BtreeDefaultDoDedup() macro from v19 to make
nbtree use deduplication wherever it is safe to do so. This default
may not be the best one in the end, though deduplication in unique
indexes now looks very compelling.

Attached is v21, which fixes some bitrot -- v20 of the patch was made
totally unusable by today's commit 8557a6f1. Other changes:

* New datum_image_eq() patch fixes up datum_image_eq() to work with
cstring/name columns, which we rely on. No need for a Valgrind
suppressions anymore. The suppression was only needed to paper over
the fact that datum_image_eq() would not really work properly with
cstring datums (the suppression was papering over a legitimate
complaint, but we fix the underlying problem with 8557a6f1 and the
v21-0001-* patch).

* New nbtdedup.c file added. This has all of the functions that dealt
with deduplication and posting lists that were previously in
nbtinsert.c and nbtutils.c. I think that this separation is somewhat
cleaner.

* Additional tweaks to the custom checkingunique algorithm used by
deduplication. This is based on further tuning from benchmarking. This
is certainly not final yet.

* Greatly simplified the code for unique index LP_DEAD killing in
_bt_check_unique(). This was pretty sloppy in v20 of the patch (it had
two "goto" labels). Now it works with the existing loop conditions
that advance to the next equal item on the page.

* Additional adjustments to the nbtree.h comments about the on-disk format.

Can you take a quick look at the first patch (the v21-0001-* patch),
Anastasia? I would like to get that one out of the way soon.

--
Peter Geoghegan

Attachments:

v21-0001-Teach-datum_image_eq-about-cstring-datums.patchapplication/x-patch; name=v21-0001-Teach-datum_image_eq-about-cstring-datums.patchDownload

From 49d1be9007130c0e80e423f99c7b043df654b0cc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 4 Nov 2019 09:07:13 -0800
Subject: [PATCH v21 1/3] Teach datum_image_eq() about cstring datums.

An upcoming patch to add deduplication to nbtree indexes needs to be
able to use datum_image_eq() as a drop-in replacement for opclass
equality in certain contexts.  This includes comparisons of TOASTable
datatypes such as text (at least when deterministic collations are in
use), and cstring datums in system catalog indexes.  cstring is used as
the storage type of "name" columns when indexed by nbtree, despite the
fact that cstring is a pseudo-type.

Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
---
 src/backend/utils/adt/datum.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 73703efe05..b20d0640ea 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -263,6 +263,8 @@ datumIsEqual(Datum value1, Datum value2, bool typByVal, int typLen)
 bool
 datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 {
+	Size		len1,
+				len2;
 	bool		result = true;
 
 	if (typByVal)
@@ -277,9 +279,6 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 	}
 	else if (typLen == -1)
 	{
-		Size		len1,
-					len2;
-
 		len1 = toast_raw_datum_size(value1);
 		len2 = toast_raw_datum_size(value2);
 		/* No need to de-toast if lengths don't match. */
@@ -304,6 +303,20 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 				pfree(arg2val);
 		}
 	}
+	else if (typLen == -2)
+	{
+		char	   *s1,
+				   *s2;
+
+		/* Compare cstring datums */
+		s1 = DatumGetCString(value1);
+		s2 = DatumGetCString(value2);
+		len1 = strlen(s1) + 1;
+		len2 = strlen(s2) + 1;
+		if (len1 != len2)
+			return false;
+		result = (memcmp(s1, s2, len1) == 0);
+	}
 	else
 		elog(ERROR, "unexpected typLen: %d", typLen);
 
-- 
2.17.1

v21-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v21-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload

From d72829b729b2e028048ebc9fcbdb9a7f47c724b8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v21 3/3] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples).  Also adds a column that shows whether or not the
LP_DEAD bit has been set.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 92 ++++++++++++++++---
 contrib/pageinspect/expected/btree.out        |  6 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
 3 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..435e71ae22 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -241,6 +242,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -252,9 +254,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[10];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -263,6 +265,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer min_htid,
+				max_htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -281,16 +285,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (rel && !_bt_heapkeyspace(rel))
+	{
+		min_htid = NULL;
+		max_htid = NULL;
+	}
+	else
+	{
+		min_htid = BTreeTupleGetHeapTID(itup);
+		if (BTreeTupleIsPosting(itup))
+			max_htid = BTreeTupleGetMaxHeapTID(itup);
+		else
+			max_htid = NULL;
+	}
+
+	if (min_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(min_htid),
+							   ItemPointerGetOffsetNumberNoCheck(min_htid));
+	else
+		values[j++] = NULL;
+
+	if (max_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(max_htid),
+							   ItemPointerGetOffsetNumberNoCheck(max_htid));
+	else
+		values[j++] = NULL;
+
+	if (min_htid == NULL)
+		values[j++] = psprintf("0");
+	else if (!BTreeTupleIsPosting(itup))
+		values[j++] = psprintf("1");
+	else
+		values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+	if (!ItemIdIsDead(id))
+		values[j++] = psprintf("f");
+	else
+		values[j++] = psprintf("t");
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -364,11 +429,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -395,12 +460,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -480,7 +546,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
+max_htid   | 
+nheap_tids | 1
+isdead     | f
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid,
+    OUT max_htid tid,
+    OUT nheap_tids int4,
+    OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v21-0002-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v21-0002-Add-deduplication-to-nbtree.patchDownload

From f241bf58420665adff152e3bc4389a119977b24f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v21 2/3] Add deduplication to nbtree

---
 src/include/access/nbtree.h             | 324 ++++++++++--
 src/include/access/nbtxlog.h            |  68 ++-
 src/include/access/rmgrlist.h           |   2 +-
 src/backend/access/common/reloptions.c  |  11 +-
 src/backend/access/index/genam.c        |   4 +
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  74 ++-
 src/backend/access/nbtree/nbtdedup.c    | 633 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c   | 330 ++++++++++--
 src/backend/access/nbtree/nbtpage.c     | 209 +++++++-
 src/backend/access/nbtree/nbtree.c      | 174 ++++++-
 src/backend/access/nbtree/nbtsearch.c   | 244 ++++++++-
 src/backend/access/nbtree/nbtsort.c     | 145 +++++-
 src/backend/access/nbtree/nbtsplitloc.c |  49 +-
 src/backend/access/nbtree/nbtutils.c    | 217 ++++++--
 src/backend/access/nbtree/nbtxlog.c     | 218 +++++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  28 +-
 contrib/amcheck/verify_nbtree.c         | 177 +++++--
 18 files changed, 2705 insertions(+), 204 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..56ab23ad79 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -107,11 +107,43 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication safe for index? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/* Storage type for Btree's reloptions */
+typedef struct BtreeOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;
+	double		vacuum_cleanup_index_scale_factor;
+	bool		dedup_enabled;	/* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ *
+ * XXX: Actually, we use deduplication everywhere for now.  Re-review this
+ * decision later on.
+ */
+#define BtreeDefaultDoDedup(relation) \
+	(relation->rd_index->indisunique ? true : true)
+
+#define BtreeGetDoDedupOption(relation) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->dedup_enabled : \
+	 BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
 /*
  * The current Btree version is 4.  That's what you'll get when you create
  * a new index.
@@ -234,8 +266,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -282,20 +313,176 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format in order to fold together multiple equal and
+ * equivalent non-pivot tuples into a single logically equivalent, space
+ * efficient representation - a posting list tuple.  A posting list is an
+ * array of ItemPointerData elements (there must be at least two elements
+ * when the posting list tuple format is used).  Posting list tuples are
+ * created dynamically by deduplication, at the point where we'd otherwise
+ * have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.  htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+	Relation	rel;
+	/* Deduplication status info for entire page/operation */
+	Size		maxitemsize;	/* BTMaxItemSize() limit for page */
+	IndexTuple	newitem;
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+	bool		overlap;		/* Avoid overlapping posting lists? */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +513,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -434,6 +654,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +694,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -499,6 +725,13 @@ typedef struct BTInsertStateData
 	/* Buffer containing leaf page we're likely to insert itup on */
 	Buffer		buf;
 
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
+
 	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
@@ -534,7 +767,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +799,12 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times.
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +817,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -725,6 +964,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+									 bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   OffsetNumber postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -743,7 +998,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1007,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1019,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1071,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		btm_safededup;
 } xl_btree_metadata;
 
 /*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index b5072c00fe..e6448e4a86 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1513,8 +1522,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index 9aab9cf64a..8140b08777 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/nbtree
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
+OBJS = nbtcompare.o nbtdedup.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
        nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..c8a63f9617
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,633 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different.  Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique().  Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set.  Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list.  Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether.  checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->rel = rel;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	state->newitem = newitem;
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	state->overlap = false;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state,
+												   RelationNeedsWAL(rel));
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+		pagesaving += _bt_dedup_finish_pending(buffer, state,
+											   RelationNeedsWAL(rel));
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->alltupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+	state->overlap = false;
+	if (state->newitem)
+	{
+		/* Might overlap with new item -- mark it as possible if it is */
+		if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+			state->overlap = true;
+	}
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists with checkingunique */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	if (state->overlap)
+	{
+		if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+		{
+			/*
+			 * newitem has heap TID in the range of the would-be new posting
+			 * list.  Avoid an immediate posting list split for caller.
+			 */
+			if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+				IndexRelationGetNumberOfAttributes(state->rel))
+			{
+				state->newitem = NULL;	/* avoid unnecessary comparisons */
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order.  Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps.  (This approach avoids any explicit WAL-logging of a
+ * posting list tuple.  This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+				 OffsetNumber postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..0a866b832e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +126,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.postingoff = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +304,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prev_all_dead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prev_all_dead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prev_all_dead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prev_all_dead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * deduplication is both possible and enabled, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (insertstate->itup_key->safededup &&
+				BtreeGetDoDedupOption(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->safededup);
+
+		/*
+		 * Don't check if the option is enabled, since no actual deduplication
+		 * will be done, just cleanup.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, checkingunique);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,43 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list by swapping new item's heap TID with
+		 * the rightmost heap TID from original posting list, and generating a
+		 * new version of the posting list that has new item's heap TID.
+		 *
+		 * Posting list splits work by modifying the overlapping posting list
+		 * as part of the same atomic operation that inserts the "new item".
+		 * The space accounting is kept simple, since it does not need to
+		 * consider posting list splits at all (this is particularly important
+		 * for the case where we also have to split the page).  Overwriting
+		 * the posting list with its post-split version is treated as an extra
+		 * step in either the insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(postingoff > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID for xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+		Assert(BTreeTupleGetNPosting(nposting) ==
+			   BTreeTupleGetNPosting(oposting));
+		/* Alter offset so that it goes after existing posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1137,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1217,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Posting list split requires an in-place update of the existing
+			 * posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1270,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1299,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.btm_safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1308,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1362,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1384,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1424,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1472,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1546,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1582,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1692,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1877,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1901,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in record, though.
+		 *
+		 * The details are often slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery can't even tell the difference).  Otherwise, we set
+		 * postingoff and log orignewitem instead of newitem, despite having
+		 * actually inserted newitem.  Recovery must reconstruct nposting and
+		 * newitem by calling _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2099,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2455,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2304,6 +2570,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..ca25e856e7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -394,6 +404,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -683,6 +694,59 @@ _bt_heapkeyspace(Relation rel)
 	return metad->btm_version > BTREE_NOVAC_VERSION;
 }
 
+/*
+ *	_bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 *
+		 * FIXME: Think some more about pg_upgrade'd !heapkeyspace indexes
+		 * here, and the need for a version bump to go with new metapage
+		 * field.  I think that we may need to bump the major version because
+		 * even v4 indexes (those built on Postgres 12) will have garbage in
+		 * the new safedup field.  Creating a v5 would mean "new field can be
+		 * trusted to not be garbage".
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			_bt_relbuf(rel, metabuf);
+			return metad->btm_safededup;;
+		}
+
+		/* Cache the metapage data for next time */
+		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+											 sizeof(BTMetaPageData));
+		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+	/* We shouldn't have cached it if any of these fail */
+	Assert(metad->btm_magic == BTREE_MAGIC);
+	Assert(metad->btm_version >= BTREE_MIN_VERSION);
+	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(metad->btm_fastroot != P_NONE);
+
+	return metad->btm_safededup;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -983,14 +1047,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1122,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1137,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1158,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1269,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2066,6 +2268,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..d3f1b4ad27 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	}
 	else
 	{
-		StdRdOptions *relopts;
+		BtreeOptions *relopts;
 		float8		cleanup_scale_factor;
 		float8		prev_num_heap_tuples;
 
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
 		 * original tuples count.
 		 */
-		relopts = (StdRdOptions *) info->index->rd_options;
+		relopts = (BtreeOptions *) info->index->rd_options;
 		cleanup_scale_factor = (relopts &&
 								relopts->vacuum_cleanup_index_scale_factor >= 0)
 			? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1191,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1241,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1254,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1351,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1378,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1393,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1403,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..561b642b1d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1230,6 +1340,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* Initialize remaining insertion scan key fields */
 	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.safededup = false;	/* unused */
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1773,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a base version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..ddf4b164e1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -725,8 +728,8 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	if (level > 0)
 		state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
 	else
-		state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
-														  BTREE_DEFAULT_FILLFACTOR);
+		state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+													   BTREE_DEFAULT_FILLFACTOR);
 	/* no parent level, yet */
 	state->btps_next = NULL;
 
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1057,6 +1063,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+
+	Assert(dstate->nitems > 0);
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+	}
+
+	_bt_buildadd(wstate, state, final);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1123,7 +1165,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1144,6 +1187,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup &&
+		BtreeGetDoDedupOption(wstate->index);
 
 	if (merge)
 	{
@@ -1255,9 +1302,97 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		dstate->newitem = NULL;
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->overlap = false;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+				dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+				/* Conservatively size array */
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * BTMaxItemSize() limit.  Heap TID(s) for itup have been
+				 * saved in state.  The next iteration will also end up here
+				 * if it's possible to merge the next tuple into the same
+				 * pending posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * BTMaxItemSize() limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..7758d74101 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
 
 	/* Count up total space in data items before actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+	leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..92c1830d82 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -107,12 +105,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+		_bt_safededup(rel);
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1386,6 +1397,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1547,6 +1559,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1786,10 +1799,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2027,7 +2065,30 @@ BTreeShmemInit(void)
 bytea *
 btoptions(Datum reloptions, bool validate)
 {
-	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+	relopt_value *options;
+	BtreeOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+		offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL, offsetof(BtreeOptions, dedup_enabled)}
+	};
+
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		return NULL;
+
+	rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+	return (bytea *) rdopts;
 }
 
 /*
@@ -2140,6 +2201,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2156,6 +2235,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2244,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2170,7 +2269,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2188,6 +2288,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2200,7 +2301,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2312,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2226,7 +2330,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2235,7 +2339,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2421,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where _bt_opclasses_support_dedup()
+ * report that deduplication is safe, this function is guaranteed to give the
+ * same result as _bt_keep_natts().
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2350,7 +2462,7 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 			break;
 
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2402,22 +2514,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2461,12 +2581,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2492,7 +2612,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2562,11 +2686,44 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.
+ *
+ * Note: This does not account for pg_uggrade'd !heapkeyspace indexes
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..27694246e2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
 #include "miscadmin.h"
 
+static MemoryContext opCtx;		/* working memory for operations */
+
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->btm_safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_swap_posting() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState *state;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->maxitemsize = BTMaxItemSize(page);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->newitem = NULL;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		state->overlap = false;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval to
+		 * deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +988,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff,
+								 xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..4e76c39a6c 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxPostingIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2654,29 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	Assert(state->heapkeyspace);
+
+	/*
+	 * Make sure that tuple type (pivot vs non-pivot) matches caller's
+	 * expectation
+	 */
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
-- 
2.17.1

#102

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#101)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Nov 4, 2019 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v21, which fixes some bitrot -- v20 of the patch was made
totally unusable by today's commit 8557a6f1. Other changes:

There is more bitrot, so I attach v22. This also has some new changes
centered around fixing particular issues with space utilization. These
changes are:

* nbtsort.c now intelligently considers the contribution of suffix
truncation of posting list tuples when considering whether or not a
leaf page is "full". I mean "full" in the sense that it has exceeded
the soft limit (fillfactor-wise limit) on space utilization for the
page (no change in how the hard limit in _bt_buildadd() works).

We don't usually bother predicting the space saving from suffix
truncation when considering split points, even in nbtsplitloc.c, but
it's worth making an exception for posting lists (actually, this is
the same exception that nbtsplitloc.c already had in much earlier
versions of the patch). Posting lists are very often large enough to
really make a big contribution to how balanced free space is. I now
observe that weird cases where CREATE INDEX packs leaf pages too empty
(or too full) are now all but eliminated. CREATE INDEX now does a
pretty good job of respecting leaf fillfactor, while also allowing
deduplication to be very effective (CREATE INDEX did neither of these
two things in earlier versions of the patch).

* Added "single value" strategy for retail insert deduplication --
this is closely related to nbtsplitloc.c's single value strategy.

The general idea is that _bt_dedup_one_page() anticipates that a
future "single value" page split is likely to occur, and therefore
limits deduplication after two "1/3 of a page"-wide posting lists at
the start of the page. It arranges for deduplication to leave a neat
split point for nbtsplitloc.c to use when the time comes. In other
words, the patch now allows "single value" page splits to leave leaf
pages BTREE_SINGLEVAL_FILLFACTOR% full, just like v12/master. Leaving
a small amount of free space on pages that are packed full of
duplicates is always a good idea. Also, we no longer force page splits
to leave pages 2/3 full (only two large posting lists plus a high
key), which sometimes happened with v21. On balance, this change seems
to slightly improve space utilization.

In general, it's now unusual for retail insertions to get better space
utilization than CREATE INDEX -- in that sense normality/balance has
been restored in v22. Actually, I wrote the v22 changes by working
through a list of weird space utilization issues from my personal
notes. I'm pretty sure I've fixed all of those -- only nbtsplitloc.c's
single value strategy wants to split at a point that leaves a heap TID
in the new high key for the page, so that's the only thing we need to
worry about within nbtdedup.c.

* "deduplication" storage parameter now has psql completion.

I intend to push the datum_image_eq() preparatory patch soon. I will
also push a commit that makes _bt_keep_natts_fast() use
datum_image_eq() separately. Anybody have an opinion on that?

--
Peter Geoghegan

Attachments:

v22-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v22-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 1d1fb340bf57f2e515a08af8ca8f22ba82fc9af9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v22 3/3] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples).  Also adds a column that shows whether or not the
LP_DEAD bit has been set.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 92 ++++++++++++++++---
 contrib/pageinspect/expected/btree.out        |  6 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
 3 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..435e71ae22 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -241,6 +242,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -252,9 +254,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[10];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -263,6 +265,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer min_htid,
+				max_htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -281,16 +285,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (rel && !_bt_heapkeyspace(rel))
+	{
+		min_htid = NULL;
+		max_htid = NULL;
+	}
+	else
+	{
+		min_htid = BTreeTupleGetHeapTID(itup);
+		if (BTreeTupleIsPosting(itup))
+			max_htid = BTreeTupleGetMaxHeapTID(itup);
+		else
+			max_htid = NULL;
+	}
+
+	if (min_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(min_htid),
+							   ItemPointerGetOffsetNumberNoCheck(min_htid));
+	else
+		values[j++] = NULL;
+
+	if (max_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(max_htid),
+							   ItemPointerGetOffsetNumberNoCheck(max_htid));
+	else
+		values[j++] = NULL;
+
+	if (min_htid == NULL)
+		values[j++] = psprintf("0");
+	else if (!BTreeTupleIsPosting(itup))
+		values[j++] = psprintf("1");
+	else
+		values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+	if (!ItemIdIsDead(id))
+		values[j++] = psprintf("f");
+	else
+		values[j++] = psprintf("t");
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -364,11 +429,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -395,12 +460,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -480,7 +546,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
+max_htid   | 
+nheap_tids | 1
+isdead     | f
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid,
+    OUT max_htid tid,
+    OUT nheap_tids int4,
+    OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v22-0002-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v22-0002-Add-deduplication-to-nbtree.patchDownload

From d3cca4d4cca643f6754c710dee11f869d5edb200 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v22 2/3] Add deduplication to nbtree

---
 src/include/access/nbtree.h             | 327 +++++++++--
 src/include/access/nbtxlog.h            |  68 ++-
 src/include/access/rmgrlist.h           |   2 +-
 src/backend/access/common/reloptions.c  |  11 +-
 src/backend/access/index/genam.c        |   4 +
 src/backend/access/nbtree/Makefile      |   1 +
 src/backend/access/nbtree/README        |  74 ++-
 src/backend/access/nbtree/nbtdedup.c    | 704 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c   | 327 +++++++++--
 src/backend/access/nbtree/nbtpage.c     | 209 ++++++-
 src/backend/access/nbtree/nbtree.c      | 174 +++++-
 src/backend/access/nbtree/nbtsearch.c   | 249 ++++++++-
 src/backend/access/nbtree/nbtsort.c     | 209 ++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  49 +-
 src/backend/access/nbtree/nbtutils.c    | 218 +++++++-
 src/backend/access/nbtree/nbtxlog.c     | 218 +++++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  28 +-
 src/bin/psql/tab-complete.c             |   4 +-
 contrib/amcheck/verify_nbtree.c         | 177 ++++--
 19 files changed, 2834 insertions(+), 219 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..afaa6b4bd8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -23,6 +23,39 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/*
+ * Storage type for Btree's reloptions
+ */
+typedef struct BtreeOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;
+	double		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ *
+ * XXX: Actually, we use deduplication everywhere for now.  Re-review this
+ * decision later on.
+ */
+#define BtreeDefaultDoDedup(relation) \
+	(relation->rd_index->indisunique ? true : true)
+
+#define BtreeGetDoDedupOption(relation) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->deduplication : \
+	 BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -102,11 +135,13 @@ typedef struct BTMetaPageData
 	uint32		btm_level;		/* tree level of the root page */
 	BlockNumber btm_fastroot;	/* current "fast" root location */
 	uint32		btm_fastlevel;	/* tree level of the "fast" root page */
-	/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
+	/* These fields only valid when btm_version >= BTREE_NOVAC_VERSION */
 	TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	/* This field only valid when btm_version >= FIXME */
+	bool		btm_safededup;	/* deduplication safe for index? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -154,6 +189,26 @@ typedef struct BTMetaPageData
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -234,8 +289,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -282,20 +336,104 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format in order to fold together multiple equal and
+ * equivalent non-pivot tuples into a single logically equivalent, space
+ * efficient representation - a posting list tuple.  A posting list is an
+ * array of ItemPointerData elements (there must be at least two elements
+ * when the posting list tuple format is used).  Posting list tuples are
+ * created dynamically by deduplication, at the point where we'd otherwise
+ * have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +464,71 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -434,6 +603,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +643,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -507,6 +682,13 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
@@ -534,7 +716,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -567,6 +752,12 @@ typedef struct BTScanPosData
 	 */
 	int			nextTupleOffset;
 
+	/*
+	 * Posting list tuples use postingTupleOffset to store the current
+	 * location of the tuple that is returned multiple times.
+	 */
+	int			postingTupleOffset;
+
 	/*
 	 * The items array is always ordered in index order (ie, increasing
 	 * indexoffset).  When scanning backwards it is convenient to fill the
@@ -578,7 +769,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -680,6 +871,57 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupState
+{
+	Relation	rel;
+	/* Deduplication status info for entire page/operation */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	IndexTuple	newitem;
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+	bool		overlap;		/* Avoid overlapping posting lists? */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupState;
+
 /*
  * Constant definition for progress reporting.  Phase numbers must match
  * btbuildphasename.
@@ -725,6 +967,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+									 bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   OffsetNumber postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -743,7 +1001,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1010,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1022,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1074,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		btm_safededup;
 } xl_btree_metadata;
 
 /*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d8790ad7a3..d69402c08d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1510,8 +1519,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..a9f9cd30f5
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,704 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different.  Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique().  Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set.  Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list.  Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether.  checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->rel = rel;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	state->newitem = newitem;
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	state->overlap = false;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state,
+												   RelationNeedsWAL(rel));
+
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		totalspace;
+
+					totalspace = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					state->maxitemsize -= totalspace *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buffer, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->alltupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+	state->overlap = false;
+	if (state->newitem)
+	{
+		/* Might overlap with new item -- mark it as possible if it is */
+		if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+			state->overlap = true;
+	}
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists with checkingunique */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	if (state->overlap)
+	{
+		if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+		{
+			/*
+			 * newitem has heap TID in the range of the would-be new posting
+			 * list.  Avoid an immediate posting list split for caller.
+			 */
+			if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+				IndexRelationGetNumberOfAttributes(state->rel))
+			{
+				state->newitem = NULL;	/* avoid unnecessary comparisons */
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order.  Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps.  (This approach avoids any explicit WAL-logging of a
+ * posting list tuple.  This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+				 OffsetNumber postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..3103d8eb56 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +304,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prev_all_dead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prev_all_dead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prev_all_dead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prev_all_dead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * deduplication is both possible and enabled, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (insertstate->itup_key->safededup &&
+				BtreeGetDoDedupOption(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->safededup);
+
+		/*
+		 * Don't check if the option is enabled, since no actual deduplication
+		 * will be done, just cleanup.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, checkingunique);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,43 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list by swapping new item's heap TID with
+		 * the rightmost heap TID from original posting list, and generating a
+		 * new version of the posting list that has new item's heap TID.
+		 *
+		 * Posting list splits work by modifying the overlapping posting list
+		 * as part of the same atomic operation that inserts the "new item".
+		 * The space accounting is kept simple, since it does not need to
+		 * consider posting list splits at all (this is particularly important
+		 * for the case where we also have to split the page).  Overwriting
+		 * the posting list with its post-split version is treated as an extra
+		 * step in either the insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(postingoff > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID for xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+		Assert(BTreeTupleGetNPosting(nposting) ==
+			   BTreeTupleGetNPosting(oposting));
+		/* Alter offset so that it goes after existing posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1137,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1217,15 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Posting list split requires an in-place update of the existing
+			 * posting list
+			 */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1267,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1296,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.btm_safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1305,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1359,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1381,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1421,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1469,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1543,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1579,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1689,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1874,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1898,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  Recovery must
+		 * reconstruct nposting and newitem by calling _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2096,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2452,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2304,6 +2567,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..ca25e856e7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -394,6 +404,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -683,6 +694,59 @@ _bt_heapkeyspace(Relation rel)
 	return metad->btm_version > BTREE_NOVAC_VERSION;
 }
 
+/*
+ *	_bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 *
+		 * FIXME: Think some more about pg_upgrade'd !heapkeyspace indexes
+		 * here, and the need for a version bump to go with new metapage
+		 * field.  I think that we may need to bump the major version because
+		 * even v4 indexes (those built on Postgres 12) will have garbage in
+		 * the new safedup field.  Creating a v5 would mean "new field can be
+		 * trusted to not be garbage".
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			_bt_relbuf(rel, metabuf);
+			return metad->btm_safededup;;
+		}
+
+		/* Cache the metapage data for next time */
+		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+											 sizeof(BTMetaPageData));
+		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+	/* We shouldn't have cached it if any of these fail */
+	Assert(metad->btm_magic == BTREE_MAGIC);
+	Assert(metad->btm_version >= BTREE_MIN_VERSION);
+	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(metad->btm_fastroot != P_NONE);
+
+	return metad->btm_safededup;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -983,14 +1047,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1122,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1137,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1158,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1269,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2066,6 +2268,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..2cdc3d499f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	}
 	else
 	{
-		StdRdOptions *relopts;
+		BtreeOptions *relopts;
 		float8		cleanup_scale_factor;
 		float8		prev_num_heap_tuples;
 
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
 		 * original tuples count.
 		 */
-		relopts = (StdRdOptions *) info->index->rd_options;
+		relopts = (BtreeOptions *) info->index->rd_options;
 		cleanup_scale_factor = (relopts &&
 								relopts->vacuum_cleanup_index_scale_factor >= 0)
 			? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1191,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1241,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1254,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1351,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1378,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1393,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1403,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..23621cdd37 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1230,6 +1340,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* Initialize remaining insertion scan key fields */
 	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.safededup = false;	/* unused */
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1773,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b5f0857598..29cc49e4b9 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,13 +715,14 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
 		state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
 	else
-		state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
-														  BTREE_DEFAULT_FILLFACTOR);
+		state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+													   BTREE_DEFAULT_FILLFACTOR);
 	/* no parent level, yet */
 	state->btps_next = NULL;
 
@@ -790,7 +795,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -822,14 +828,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -843,6 +862,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -884,10 +905,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -945,11 +966,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -984,7 +1005,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1046,6 +1067,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1091,7 +1153,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1112,7 +1174,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1133,6 +1196,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup &&
+		BtreeGetDoDedupOption(wstate->index);
 
 	if (merge)
 	{
@@ -1229,12 +1296,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1244,9 +1311,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		dstate->newitem = NULL;
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->overlap = false;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(BTMaxItemSize(state->btps_page),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1254,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..7758d74101 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
 
 	/* Count up total space in data items before actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+	leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 6a3008dd48..6fec8cb745 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -98,8 +98,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +106,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+		_bt_safededup(rel);
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1373,6 +1384,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1546,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1786,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2014,7 +2052,31 @@ BTreeShmemInit(void)
 bytea *
 btoptions(Datum reloptions, bool validate)
 {
-	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+	relopt_value *options;
+	BtreeOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+		offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BtreeOptions, deduplication)}
+	};
+
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		return NULL;
+
+	rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+	return (bytea *) rdopts;
 }
 
 /*
@@ -2127,6 +2189,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2143,6 +2223,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2150,6 +2232,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2157,7 +2257,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2175,6 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2187,7 +2289,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2198,9 +2300,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2213,7 +2318,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2222,7 +2327,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2303,15 +2409,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where _bt_opclasses_support_dedup()
+ * report that deduplication is safe, this function is guaranteed to give the
+ * same result as _bt_keep_natts().
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2337,7 +2450,7 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 			break;
 
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2389,22 +2502,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2448,12 +2569,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2479,7 +2600,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2549,11 +2674,44 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.
+ *
+ * Note: This does not account for pg_uggrade'd !heapkeyspace indexes
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..27694246e2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
 #include "miscadmin.h"
 
+static MemoryContext opCtx;		/* working memory for operations */
+
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->btm_safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_swap_posting() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState *state;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->maxitemsize = BTMaxItemSize(page);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->newitem = NULL;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		state->overlap = false;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval to
+		 * deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +988,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff,
+								 xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2b1e3cda4a..bf4a27ab75 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..ebbbae137a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2654,29 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	Assert(state->heapkeyspace);
+
+	/*
+	 * Make sure that tuple type (pivot vs non-pivot) matches caller's
+	 * expectation
+	 */
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
-- 
2.17.1

v22-0001-Teach-datum_image_eq-about-cstring-datums.patchapplication/octet-stream; name=v22-0001-Teach-datum_image_eq-about-cstring-datums.patchDownload

From 2697341e50a43ee544f496189a6180eff9713a78 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 4 Nov 2019 09:07:13 -0800
Subject: [PATCH v22 1/3] Teach datum_image_eq() about cstring datums.

An upcoming patch to add deduplication to nbtree indexes needs to be
able to use datum_image_eq() as a drop-in replacement for opclass
equality in certain contexts.  This includes comparisons of TOASTable
datatypes such as text (at least when deterministic collations are in
use), and cstring datums in system catalog indexes.  cstring is used as
the storage type of "name" columns when indexed by nbtree, despite the
fact that cstring is a pseudo-type.

Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
---
 src/backend/utils/adt/datum.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 73703efe05..b20d0640ea 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -263,6 +263,8 @@ datumIsEqual(Datum value1, Datum value2, bool typByVal, int typLen)
 bool
 datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 {
+	Size		len1,
+				len2;
 	bool		result = true;
 
 	if (typByVal)
@@ -277,9 +279,6 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 	}
 	else if (typLen == -1)
 	{
-		Size		len1,
-					len2;
-
 		len1 = toast_raw_datum_size(value1);
 		len2 = toast_raw_datum_size(value2);
 		/* No need to de-toast if lengths don't match. */
@@ -304,6 +303,20 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 				pfree(arg2val);
 		}
 	}
+	else if (typLen == -2)
+	{
+		char	   *s1,
+				   *s2;
+
+		/* Compare cstring datums */
+		s1 = DatumGetCString(value1);
+		s2 = DatumGetCString(value2);
+		len1 = strlen(s1) + 1;
+		len2 = strlen(s2) + 1;
+		if (len1 != len2)
+			return false;
+		result = (memcmp(s1, s2, len1) == 0);
+	}
 	else
 		elog(ERROR, "unexpected typLen: %d", typLen);
 
-- 
2.17.1

#103

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#102)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Nov 8, 2019 at 10:35 AM Peter Geoghegan <pg@bowt.ie> wrote:

There is more bitrot, so I attach v22.

The patch has stopped applying once again, so I attach v23.

One reason for the bitrot is that I pushed preparatory commits,
including today's "Make _bt_keep_natts_fast() use datum_image_eq()"
commit. Good to get that out of the way.

Other changes:

* Decided to go back to turning deduplication on by default with
non-unique indexes, and off by default using unique indexes.

The unique index stuff was regressed enough with INSERT-heavy
workloads that I was put off, despite my initial enthusiasm for
enabling deduplication everywhere.

* Disabled deduplication in system catalog indexes by deeming it
generally unsafe.

I realized that it would be impossible to provide a way to disable
deduplication in system catalog indexes if it was enabled at all. The
reason for this is simple: in general, it's not possible to set
storage parameters for system catalog indexes.

While I think that deduplication should work with system catalog
indexes on general principle, this is about an existing limitation.
Deduplication in catalog indexes can be revisited if and when somebody
figures out a way to make storage parameters work with system catalog
indexes.

* Basic user documentation -- this still needs work, but the basic
shape is now in place. I think that we should outline how the feature
works by describing the internals, including details of the data
structures. This provides guidance to users on when they should
disable or enable the feature.

This is discussed in the existing chapter on B-Tree internals. This
felt natural because it's similar to how GIN explains its compression
related features -- the discussion of the storage parameters in the
CREATE INDEX page of the docs links to a description of GIN internals
from "66.4. Implementation [of GIN]".

* nbtdedup.c "single value" strategy stuff now considers the
contribution of the page high key when considering how to deduplicate
such that nbtsplitloc.c's "single value" strategy has a usable split
point that helps it to hit its target free space. Not a very important
detail. It's nice to be consistent with the corresponding code within
nbtsplitloc.c.

* Worked through all remaining XXX/TODO/FIXME comments, except one:
The one that talks about the need for opclass infrastructure to deal
with cases like btree/numeric_ops, or text with a nondeterministic
collation.

The user docs now reference the BITWISE opclass stuff that we're
discussing over on the other thread. That's the only really notable
open item now IMV.

--
Peter Geoghegan

Attachments:

v23-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v23-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload

From f30417cc9d917d85d5a36d64984bab0096034dc9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v23 2/2] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples).  Also adds a column that shows whether or not the
LP_DEAD bit has been set.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 92 ++++++++++++++++---
 contrib/pageinspect/expected/btree.out        |  6 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
 3 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..435e71ae22 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -241,6 +242,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -252,9 +254,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[10];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -263,6 +265,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer min_htid,
+				max_htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -281,16 +285,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	if (rel && !_bt_heapkeyspace(rel))
+	{
+		min_htid = NULL;
+		max_htid = NULL;
+	}
+	else
+	{
+		min_htid = BTreeTupleGetHeapTID(itup);
+		if (BTreeTupleIsPosting(itup))
+			max_htid = BTreeTupleGetMaxHeapTID(itup);
+		else
+			max_htid = NULL;
+	}
+
+	if (min_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(min_htid),
+							   ItemPointerGetOffsetNumberNoCheck(min_htid));
+	else
+		values[j++] = NULL;
+
+	if (max_htid)
+		values[j++] = psprintf("(%u,%u)",
+							   ItemPointerGetBlockNumberNoCheck(max_htid),
+							   ItemPointerGetOffsetNumberNoCheck(max_htid));
+	else
+		values[j++] = NULL;
+
+	if (min_htid == NULL)
+		values[j++] = psprintf("0");
+	else if (!BTreeTupleIsPosting(itup))
+		values[j++] = psprintf("1");
+	else
+		values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+	if (!ItemIdIsDead(id))
+		values[j++] = psprintf("f");
+	else
+		values[j++] = psprintf("t");
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -364,11 +429,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -395,12 +460,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -480,7 +546,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
+max_htid   | 
+nheap_tids | 1
+isdead     | f
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid,
+    OUT max_htid tid,
+    OUT nheap_tids int4,
+    OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v23-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v23-0001-Add-deduplication-to-nbtree.patchDownload

From 8f059dc694460832417ed70e512bdef274ff84a7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v23 1/2] Add deduplication to nbtree

---
 src/include/access/nbtree.h             | 328 +++++++++--
 src/include/access/nbtxlog.h            |  68 ++-
 src/include/access/rmgrlist.h           |   2 +-
 src/backend/access/common/reloptions.c  |  11 +-
 src/backend/access/index/genam.c        |   4 +
 src/backend/access/nbtree/Makefile      |   1 +
 src/backend/access/nbtree/README        |  74 ++-
 src/backend/access/nbtree/nbtdedup.c    | 710 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c   | 321 +++++++++--
 src/backend/access/nbtree/nbtpage.c     | 211 ++++++-
 src/backend/access/nbtree/nbtree.c      | 174 +++++-
 src/backend/access/nbtree/nbtsearch.c   | 249 ++++++++-
 src/backend/access/nbtree/nbtsort.c     | 209 ++++++-
 src/backend/access/nbtree/nbtsplitloc.c |  38 +-
 src/backend/access/nbtree/nbtutils.c    | 218 +++++++-
 src/backend/access/nbtree/nbtxlog.c     | 218 +++++++-
 src/backend/access/rmgrdesc/nbtdesc.c   |  28 +-
 src/bin/psql/tab-complete.c             |   4 +-
 contrib/amcheck/verify_nbtree.c         | 177 ++++--
 doc/src/sgml/btree.sgml                 |  48 +-
 doc/src/sgml/charset.sgml               |   9 +-
 doc/src/sgml/ref/create_index.sgml      |  43 +-
 doc/src/sgml/ref/reindex.sgml           |   5 +-
 23 files changed, 2921 insertions(+), 229 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..d59d1dd574 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -23,6 +23,36 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/*
+ * Storage type for Btree's reloptions
+ */
+typedef struct BtreeOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;		/* leaf fillfactor */
+	double		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * Deduplication is enabled for non unique indexes and disabled for unique
+ * indexes by default
+ */
+#define BtreeDefaultDoDedup(relation) \
+	(relation->rd_index->indisunique ? false : true)
+
+#define BtreeGetDoDedupOption(relation) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->deduplication : \
+	 BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -107,6 +137,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -114,7 +145,8 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *
  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
@@ -131,8 +163,8 @@ typedef struct BTMetaPageData
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number in metapage */
 #define BTREE_VERSION	4		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
-#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
+#define BTREE_MIN_VERSION	2	/* minimum supported version */
+#define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
@@ -154,6 +186,26 @@ typedef struct BTMetaPageData
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -234,8 +286,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -282,20 +333,104 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format in order to fold together multiple equal and
+ * equivalent non-pivot tuples into a single logically equivalent, space
+ * efficient representation - a posting list tuple.  A posting list is an
+ * array of ItemPointerData elements (there must be at least two elements
+ * when the posting list tuple format is used).  Posting list tuples are
+ * created dynamically by deduplication, at the point where we'd otherwise
+ * have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +461,71 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -434,6 +600,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +640,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -507,6 +679,13 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
@@ -534,7 +713,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -567,6 +749,12 @@ typedef struct BTScanPosData
 	 */
 	int			nextTupleOffset;
 
+	/*
+	 * Posting list tuples use postingTupleOffset to store the current
+	 * location of the tuple that is returned multiple times.
+	 */
+	int			postingTupleOffset;
+
 	/*
 	 * The items array is always ordered in index order (ie, increasing
 	 * indexoffset).  When scanning backwards it is convenient to fill the
@@ -578,7 +766,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -680,6 +868,57 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupState
+{
+	Relation	rel;
+	/* Deduplication status info for entire page/operation */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	IndexTuple	newitem;
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+	bool		overlap;		/* Avoid overlapping posting lists? */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupState;
+
 /*
  * Constant definition for progress reporting.  Phase numbers must match
  * btbuildphasename.
@@ -725,6 +964,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+									 bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -743,7 +998,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1007,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1019,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1071,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		btm_safededup;
 } xl_btree_metadata;
 
 /*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d8790ad7a3..d69402c08d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1510,8 +1519,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..dde1d68d6f
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,710 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different.  Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique().  Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set.  Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list.  Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether.  checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->rel = rel;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	state->newitem = newitem;
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	state->overlap = false;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state,
+												   RelationNeedsWAL(rel));
+
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		leftfree;
+
+					/* This calculation needs to match nbtsplitloc.c */
+					leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					/* Subtract predicted size of new high key */
+					leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+					/*
+					 * Reduce maxitemsize by an amount equal to target free
+					 * space on left half of page
+					 */
+					state->maxitemsize -= leftfree *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buffer, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->alltupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+	state->overlap = false;
+	if (state->newitem)
+	{
+		/* Might overlap with new item -- mark it as possible if it is */
+		if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+			state->overlap = true;
+	}
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists with checkingunique */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	if (state->overlap)
+	{
+		if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+		{
+			/*
+			 * newitem has heap TID in the range of the would-be new posting
+			 * list.  Avoid an immediate posting list split for caller.
+			 */
+			if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+				IndexRelationGetNumberOfAttributes(state->rel))
+			{
+				state->newitem = NULL;	/* avoid unnecessary comparisons */
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order.  Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps.  (This approach avoids any explicit WAL-logging of a
+ * posting list tuple.  This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..e5f6023ad0 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +304,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prev_all_dead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prev_all_dead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prev_all_dead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prev_all_dead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * deduplication is both possible and enabled, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (insertstate->itup_key->safededup &&
+				BtreeGetDoDedupOption(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->safededup);
+
+		/*
+		 * Don't check if the option is enabled, since no actual deduplication
+		 * will be done, just cleanup.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, checkingunique);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,39 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list by swapping new item's heap TID with
+		 * the rightmost heap TID from original posting list, and generating a
+		 * new version of the posting list that has new item's heap TID.
+		 *
+		 * Posting list splits work by modifying the overlapping posting list
+		 * as part of the same atomic operation that inserts the "new item".
+		 * The space accounting is kept simple, since it does not need to
+		 * consider posting list splits at all (this is particularly important
+		 * for the case where we also have to split the page).  Overwriting
+		 * the posting list with its post-split version is treated as an extra
+		 * step in either the insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID for xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+		/* Alter offset so that it goes after existing posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1133,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1213,13 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		/*
+		 * Posting list split requires an in-place update of the existing
+		 * posting list
+		 */
+		if (nposting)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1261,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1290,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.btm_safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1299,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1353,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1375,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1415,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1463,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1537,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1573,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1683,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1868,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1892,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  Recovery must
+		 * reconstruct nposting and newitem by calling _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2090,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2446,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2304,6 +2561,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..77f443f7a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,6 +633,7 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
@@ -683,6 +699,56 @@ _bt_heapkeyspace(Relation rel)
 	return metad->btm_version > BTREE_NOVAC_VERSION;
 }
 
+/*
+ *	_bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 *
+		 * Note that we rely on the assumption that this field will be zero'ed
+		 * on indexes that were pg_upgrade'd.
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			_bt_relbuf(rel, metabuf);
+			return metad->btm_safededup;;
+		}
+
+		/* Cache the metapage data for next time */
+		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+											 sizeof(BTMetaPageData));
+		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+	/* We shouldn't have cached it if any of these fail */
+	Assert(metad->btm_magic == BTREE_MAGIC);
+	Assert(metad->btm_version >= BTREE_MIN_VERSION);
+	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
+	Assert(metad->btm_fastroot != P_NONE);
+
+	return metad->btm_safededup;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -983,14 +1049,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1124,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1139,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1160,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1271,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2066,6 +2270,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..2cdc3d499f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	}
 	else
 	{
-		StdRdOptions *relopts;
+		BtreeOptions *relopts;
 		float8		cleanup_scale_factor;
 		float8		prev_num_heap_tuples;
 
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
 		 * original tuples count.
 		 */
-		relopts = (StdRdOptions *) info->index->rd_options;
+		relopts = (BtreeOptions *) info->index->rd_options;
 		cleanup_scale_factor = (relopts &&
 								relopts->vacuum_cleanup_index_scale_factor >= 0)
 			? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1191,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1241,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1254,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1351,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1378,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1393,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1403,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..23621cdd37 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1230,6 +1340,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* Initialize remaining insertion scan key fields */
 	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.safededup = false;	/* unused */
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1773,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c11a3fb570..84bee940b3 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,13 +715,14 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
 		state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
 	else
-		state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
-														  BTREE_DEFAULT_FILLFACTOR);
+		state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+													   BTREE_DEFAULT_FILLFACTOR);
 	/* no parent level, yet */
 	state->btps_next = NULL;
 
@@ -790,7 +795,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -822,14 +828,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -843,6 +862,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -884,10 +905,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -945,11 +966,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -984,7 +1005,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1046,6 +1067,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1091,7 +1153,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1112,7 +1174,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1133,6 +1196,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup &&
+		BtreeGetDoDedupOption(wstate->index);
 
 	if (merge)
 	{
@@ -1229,12 +1296,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1244,9 +1311,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		dstate->newitem = NULL;
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->overlap = false;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(BTMaxItemSize(state->btps_page),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1254,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..8078522b5c 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -51,6 +51,7 @@ typedef struct
 	Size		newitemsz;		/* size of newitem (includes line pointer) */
 	bool		is_leaf;		/* T if splitting a leaf page */
 	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	bool		is_deduped;		/* T if posting list truncation expected */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
@@ -167,7 +168,7 @@ _bt_findsplitloc(Relation rel,
 
 	/* Count up total space in data items before actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+	leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -177,12 +178,16 @@ _bt_findsplitloc(Relation rel,
 	state.newitemsz = newitemsz;
 	state.is_leaf = P_ISLEAF(opaque);
 	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.is_deduped = state.is_leaf && BtreeGetDoDedupOption(rel);
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +464,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +474,31 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple.
+		 *
+		 * Individual posting lists often take up a significant fraction of
+		 * all space on a page.  Failing to consider that the new high key
+		 * won't need to store the posting list a second time really matters.
+		 */
+		if (state->is_leaf && state->is_deduped)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +521,11 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead.
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsz) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +722,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7669a1a66f..2601b59f29 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -98,8 +99,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +107,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+		_bt_safededup(rel);
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1373,6 +1385,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1547,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1787,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2014,7 +2053,31 @@ BTreeShmemInit(void)
 bytea *
 btoptions(Datum reloptions, bool validate)
 {
-	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+	relopt_value *options;
+	BtreeOptions *rdopts;
+	int			numoptions;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+		offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BtreeOptions, deduplication)}
+	};
+
+	options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+							  &numoptions);
+
+	/* if none set, we're done */
+	if (numoptions == 0)
+		return NULL;
+
+	rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+	fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+				   validate, tab, lengthof(tab));
+
+	pfree(options);
+	return (bytea *) rdopts;
 }
 
 /*
@@ -2127,6 +2190,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2143,6 +2224,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2150,6 +2233,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2157,7 +2258,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2175,6 +2277,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2187,7 +2290,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2198,9 +2301,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2213,7 +2319,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2222,7 +2328,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2310,6 +2417,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where _bt_opclasses_support_dedup()
+ * report that deduplication is safe, this function is guaranteed to give the
+ * same result as _bt_keep_natts().
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2387,22 +2498,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2446,12 +2565,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2477,7 +2596,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2547,11 +2670,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the "deduplicate" storage
+	 * parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 44f6283950..d36d31c758 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->btm_safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_swap_posting() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState *state;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->maxitemsize = BTMaxItemSize(page);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->newitem = NULL;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		state->overlap = false;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval to
+		 * deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +988,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff,
+								 xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2b1e3cda4a..bf4a27ab75 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3542545de5..cfdc968c6d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2654,29 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	Assert(state->heapkeyspace);
+
+	/*
+	 * Make sure that tuple type (pivot vs non-pivot) matches caller's
+	 * expectation
+	 */
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..a231bbe1f2 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,55 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is unique, since the
+  item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys prior to the heap item
+   pointer are folded together into a compressed representation called
+   a <quote>posting list</quote>. The user-visible keys appear only
+   once, followed by a simple list of heap item pointers.  Posting
+   lists are formed at the point where an insertion would otherwise
+   have to split the page.  This can greatly increase index space
+   efficiency with data sets where each distinct key appears a few
+   times on average.  Cases that don't benefit will incur a small
+   performance penalty.
+  </para>
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.
+   Deduplication is not supported with nondeterministic collations,
+   nor is it supported with <literal>INCLUDE</literal> indexes.  The
+   deduplication storage parameter must be set to
+   <literal>ON</literal> for new posting lists to be formed
+   (deduplication is enabled by default in the case of non-unique
+   indexes).
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Unique indexes and deduplication</title>
+
+  <para>
+   Unique indexes can also use deduplication.  This can be useful with
+   unique indexes that are prone to becoming bloated despite
+   aggressive vacuuming.  Deduplication may delay leaf page splits for
+   long enough that vacuuming can prevent unnecesary page splits
+   altogether.
   </para>
 
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..2261226965 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,38 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+    <term><literal>deduplication</literal>
+     <indexterm>
+      <primary><varname>deduplication</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      This setting controls usage of the B-tree deduplication
+      technique described in <xref linkend="btree-deduplication"/>.
+      Defaults to <literal>ON</literal> for non-unique indexes, and
+      <literal>OFF</literal> for unique indexes.  (Alternative
+      spellings of <literal>ON</literal> and <literal>OFF</literal>
+      are allowed as described in <xref linkend="config-setting"/>.)
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplication</literal> off via <command>ALTER
+      INDEX</command> prevents future insertions from triggering
+      deduplication, but does not in itself make existing posting list
+      tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +476,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
@@ -831,6 +859,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
 </programlisting>
   </para>
 
+  <para>
+   To create a unique index with deduplication enabled:
+<programlisting>
+CREATE UNIQUE INDEX title_idx ON films (title) WITH (deduplication = on);
+</programlisting>
+  </para>
+
   <para>
    To create a <acronym>GIN</acronym> index with fast updates disabled:
 <programlisting>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 10881ab03a..c9a5349019 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -58,8 +58,9 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR
 
     <listitem>
      <para>
-      You have altered a storage parameter (such as fillfactor)
-      for an index, and wish to ensure that the change has taken full effect.
+      You have altered a storage parameter (such as fillfactor or
+      deduplication) for an index, and wish to ensure that the change has
+      taken full effect.
      </para>
     </listitem>
 
-- 
2.17.1

#104

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Peter Geoghegan (#103)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

* Disabled deduplication in system catalog indexes by deeming it
generally unsafe.

I (continue to) think that deduplication is a terrible name, because
you're not getting rid of the duplicates. You are using a compressed
representation of the duplicates.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#105

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Robert Haas (#104)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Nov 13, 2019 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

* Disabled deduplication in system catalog indexes by deeming it
generally unsafe.

I (continue to) think that deduplication is a terrible name, because
you're not getting rid of the duplicates. You are using a compressed
representation of the duplicates.

"Deduplication" never means that you get rid of duplicates. According
to Wikipedia's deduplication article: "Whereas compression algorithms
identify redundant data inside individual files and encodes this
redundant data more efficiently, the intent of deduplication is to
inspect large volumes of data and identify large sections – such as
entire files or large sections of files – that are identical, and
replace them with a shared copy".

This seemed like it fit what this patch does. We're concerned with a
specific, simple kind of redundancy. Also:

* From the user's point of view, we're merging together what they'd
call duplicates. They don't really think of the heap TID as part of
the key.

* The term "compression" suggests a decompression penalty when
reading, which is not the case here.

* The term "compression" confuses the feature added by the patch with
TOAST compression. Now we may have two very different varieties of
compression in the same index.

Can you suggest an alternative?

--
Peter Geoghegan

#106

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Peter Geoghegan (#105)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Nov 13, 2019 at 2:51 PM Peter Geoghegan <pg@bowt.ie> wrote:

"Deduplication" never means that you get rid of duplicates. According
to Wikipedia's deduplication article: "Whereas compression algorithms
identify redundant data inside individual files and encodes this
redundant data more efficiently, the intent of deduplication is to
inspect large volumes of data and identify large sections – such as
entire files or large sections of files – that are identical, and
replace them with a shared copy".

Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.

Can you suggest an alternative?

My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#107

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Robert Haas (#106)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Nov 15, 2019 at 5:16 AM Robert Haas <robertmhaas@gmail.com> wrote:

Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.

I see your point.

One reason for my avoiding the word "compression" is that other DB
systems that have something similar don't use the word compression
either. Actually, they don't really call it *anything*. Posting lists
are simply the way that secondary indexes work. The "Modern B-Tree
techniques" book/survey paper mentions the idea of using a TID list in
its "3.7 Duplicate Key Values" section, not in the two related
sections that follow ("Bitmap Indexes", and "Data Compression").

That doesn't seem like a very good argument, now that I've typed it
out. The patch applies deduplication/compression/whatever at the point
where we'd otherwise have to split the page, unlike GIN. GIN eagerly
maintains posting lists (doing in-place updates for most insertions
seems pretty bad to me). My argument could reasonably be made about
GIN, which really does consider posting lists the natural way to store
duplicate tuples. I cannot really make that argument about nbtree with
this patch, though -- delaying a page split by re-encoding tuples
(changing their physical representation without changing their logical
contents) justifies using the word "compression" in the name.

Can you suggest an alternative?

My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.

Does anyone else want to weigh in on this? Anastasia?

I will go along with whatever the consensus is. I'm very close to the
problem we're trying to solve, which probably isn't helping me here.

--
Peter Geoghegan

#108

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#81)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?

I don't know about the regression tests (I doubt that there is a
natural place for such a test), but I came up with a rough test case.
I more or less copied the approach that you took with the index build
WAL reduction patches, though I also figured out a way of subtracting
heapam WAL overhead to get a real figure. I attach the test case --
note that you'll need to use the "land" database with this. (This test
case might need to be improved, but it's a good start.)

I used a test script similar to the "nbtree_wal_test.sql" test script
I posted on September 11th today. I am concerned about the WAL
overhead for cases that don't benefit from the patch (usually because
they turn off deduplication altogether). The details of the index
tested were different this time, though. I used an index that had the
smallest possible tuple size: 16 bytes (this is the smallest possible
size on 64-bit systems, but that's what almost everybody uses these
days). So any index with one or two int4 columns (or one int8 column)
will generally have 16 byte IndexTuples, at least when there are no
NULLs in the index. In general, 16 byte wide tuples are very, very
common.

What I saw suggests that we will need to remove the new "postingoff"
field from xl_btree_insert. (We can create a new XLog record for leaf
page inserts that also need to split a posting list, without changing
much else.)

The way that *alignment* of WAL records affects these common 16 byte
IndexTuple cases is the real problem. Adding "postingoff" to
xl_btree_insert increases the WAL required for INSERT_LEAF records by
two bytes (sizeof(OffsetNumber)), as you'd expect -- pg_waldump output
shows that they're 66 bytes, whereas they're only 64 bytes on the
master branch. That doesn't sound that bad, but once you consider the
alignment of whole records, it's really an extra 8 bytes. That is
totally unacceptable. The vast majority of nbtree WAL records are
bound to be INSERT_LEAF records, so as things stand we have added
(almost) 12.5% space overhead to nbtree for these common cases, that
don't benefit.

I haven't really looked into other types of WAL record just yet. The
real world overhead that we're adding to xl_btree_vacuum records is
something that I will have to look into separately. I'm already pretty
sure that adding two bytes to xl_btree_split is okay, though, because
they're far less numerous than xl_btree_insert records, and aren't
affected by alignment in the same way (they're already several hundred
bytes in almost all cases).

I also noticed something positive: The overhead of xl_btree_dedup WAL
records seems to be very low with indexes that have hundreds of
logical tuples for each distinct integer value. We don't seem to have
a problem with "deduplication thrashing".

--
Peter Geoghegan

#109

Mark Dilger

hornschnorter@gmail.com

about 6 years ago

In reply to: Peter Geoghegan (#105)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On 11/13/19 11:51 AM, Peter Geoghegan wrote:

Can you suggest an alternative?

Dupression

--
Mark Dilger

#110

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Mark Dilger (#109)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Nov 15, 2019 at 5:43 PM Mark Dilger <hornschnorter@gmail.com> wrote:

On 11/13/19 11:51 AM, Peter Geoghegan wrote:

Can you suggest an alternative?

Dupression

This suggestion makes me feel better about "deduplication".

--
Peter Geoghegan

#111

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Oleg Bartunov (#85)

Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.

On Sun, Sep 15, 2019 at 3:47 AM Oleg Bartunov <obartunov@postgrespro.ru> wrote:

Is it worth to make a provision to add an ability to control how
duplicates are sorted ?

Duplicates will continue to be sorted based on TID, in effect. We want
to preserve the ability to perform retail index tuple deletion. I
believe that that will become important in the future.

If we speak about GIN, why not take into
account our experiments with RUM (https://github.com/postgrespro/rum)
?

FWIW, I think that it's confusing that RUM almost shares its name with
the "RUM conjecture":

http://daslab.seas.harvard.edu/rum-conjecture/

--
Peter Geoghegan

#112

nospam-abuse@bloodgate.com

about 6 years ago

In reply to: Peter Geoghegan (#107)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Moin,

On 2019-11-16 01:04, Peter Geoghegan wrote:

On Fri, Nov 15, 2019 at 5:16 AM Robert Haas <robertmhaas@gmail.com>
wrote:

Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.

I see your point.

One reason for my avoiding the word "compression" is that other DB
systems that have something similar don't use the word compression
either. Actually, they don't really call it *anything*. Posting lists
are simply the way that secondary indexes work. The "Modern B-Tree
techniques" book/survey paper mentions the idea of using a TID list in
its "3.7 Duplicate Key Values" section, not in the two related
sections that follow ("Bitmap Indexes", and "Data Compression").

That doesn't seem like a very good argument, now that I've typed it
out. The patch applies deduplication/compression/whatever at the point
where we'd otherwise have to split the page, unlike GIN. GIN eagerly
maintains posting lists (doing in-place updates for most insertions
seems pretty bad to me). My argument could reasonably be made about
GIN, which really does consider posting lists the natural way to store
duplicate tuples. I cannot really make that argument about nbtree with
this patch, though -- delaying a page split by re-encoding tuples
(changing their physical representation without changing their logical
contents) justifies using the word "compression" in the name.

Can you suggest an alternative?

My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.

Does anyone else want to weigh in on this? Anastasia?

I will go along with whatever the consensus is. I'm very close to the
problem we're trying to solve, which probably isn't helping me here.

I'm in favor of deduplication and not compression. Compression is a more
generic term and can involve deduplication, but it hasn't to do so. (It
could for instance just encode things in a more compact form). While
deduplication does not involve compression, it just means store multiple
things once, which by coincidence also amounts to using less space like
compression can do.

ZFS also follows this by having both deduplication (store the same
blocks only once with references) and compression (compress block
contents, regardless wether they are stored once or many times).

So my vote is for deduplication (if I understand the thread correctly
this is what the code no does, by storing the exact same key not that
many times but only once with references or a count?).

best regards,

Tels

#113

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#108)

2 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Nov 15, 2019 at 5:02 PM Peter Geoghegan <pg@bowt.ie> wrote:

What I saw suggests that we will need to remove the new "postingoff"
field from xl_btree_insert. (We can create a new XLog record for leaf
page inserts that also need to split a posting list, without changing
much else.)

Attached is v24. This revision doesn't fix the problem with
xl_btree_insert record bloat, but it does fix the bitrot against the
master branch that was caused by commit 50d22de9. (This patch has had
a surprisingly large number of conflicts against the master branch
recently.)

Other changes:

* The pageinspect patch has been cleaned up. I now propose that it be
committed alongside the main patch.

The big change here is that posting lists are represented as an array
of TIDs within bt_page_items(), much like gin_leafpage_items(). Also
added documentation that goes into the ways in which ctid can be used
to encode information (arguably some of this should have been included
with the Postgres 12 B-Tree work).

* Basic tests that cover deduplication within unique indexes. We ought
to have code coverage of the case where _bt_check_unique() has to step
right (actually, we don't have that on the master branch either).

--
Peter Geoghegan

Attachments:

v24-0002-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/octet-stream; name=v24-0002-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From b9835f1bf8426b50cbd0fc0b0804101f91efc9a6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v24 2/2] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 110 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   6 +
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  36 ++++++
 doc/src/sgml/pageinspect.sgml                 |  80 +++++++------
 4 files changed, 180 insertions(+), 52 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..418eef032d 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,24 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump, *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +286,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * heap tuple representation
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +314,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to use, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +450,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +468,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +536,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +554,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..1d45cd5c1e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,6 +41,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +57,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..70f1ab0467 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,39 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..1763e9c6f0 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -329,11 +329,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +356,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +414,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v24-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v24-0001-Add-deduplication-to-nbtree.patchDownload

From 7c77d41afd91d2021948fd03be82129b9452b9a5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v24 1/2] Add deduplication to nbtree

---
 src/include/access/nbtree.h               | 333 ++++++++--
 src/include/access/nbtxlog.h              |  68 ++-
 src/include/access/rmgrlist.h             |   2 +-
 src/backend/access/common/reloptions.c    |  11 +-
 src/backend/access/index/genam.c          |   4 +
 src/backend/access/nbtree/Makefile        |   1 +
 src/backend/access/nbtree/README          |  74 ++-
 src/backend/access/nbtree/nbtdedup.c      | 710 ++++++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c     | 321 +++++++++-
 src/backend/access/nbtree/nbtpage.c       | 211 ++++++-
 src/backend/access/nbtree/nbtree.c        | 174 +++++-
 src/backend/access/nbtree/nbtsearch.c     | 250 +++++++-
 src/backend/access/nbtree/nbtsort.c       | 209 ++++++-
 src/backend/access/nbtree/nbtsplitloc.c   |  38 +-
 src/backend/access/nbtree/nbtutils.c      | 216 ++++++-
 src/backend/access/nbtree/nbtxlog.c       | 218 ++++++-
 src/backend/access/rmgrdesc/nbtdesc.c     |  28 +-
 src/bin/psql/tab-complete.c               |   4 +-
 contrib/amcheck/verify_nbtree.c           | 180 ++++--
 doc/src/sgml/btree.sgml                   |  48 +-
 doc/src/sgml/charset.sgml                 |   9 +-
 doc/src/sgml/ref/create_index.sgml        |  43 +-
 doc/src/sgml/ref/reindex.sgml             |   5 +-
 src/test/regress/expected/btree_index.out |  16 +
 src/test/regress/sql/btree_index.sql      |  17 +
 25 files changed, 2945 insertions(+), 245 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..1c82357e0d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -23,6 +23,36 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/*
+ * Storage type for Btree's reloptions
+ */
+typedef struct BtreeOptions
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	int			fillfactor;		/* leaf fillfactor */
+	double		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * Deduplication is enabled for non unique indexes and disabled for unique
+ * indexes by default
+ */
+#define BtreeDefaultDoDedup(relation) \
+	(relation->rd_index->indisunique ? false : true)
+
+#define BtreeGetDoDedupOption(relation) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->deduplication : \
+	 BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+	((relation)->rd_options ? \
+	 ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -107,6 +137,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -114,7 +145,8 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *
  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
@@ -131,8 +163,8 @@ typedef struct BTMetaPageData
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number in metapage */
 #define BTREE_VERSION	4		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
-#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
+#define BTREE_MIN_VERSION	2	/* minimum supported version */
+#define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
@@ -154,6 +186,26 @@ typedef struct BTMetaPageData
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -229,16 +281,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -282,20 +333,103 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication folds together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation.  A posting list is
+ * an array of ItemPointerData elements.  Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +460,69 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Set the heap TID attribute for a pivot tuple
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -434,6 +597,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +637,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -507,6 +676,13 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
@@ -534,7 +710,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -567,6 +746,12 @@ typedef struct BTScanPosData
 	 */
 	int			nextTupleOffset;
 
+	/*
+	 * Posting list tuples use postingTupleOffset to store the current
+	 * location of the tuple that is returned multiple times.
+	 */
+	int			postingTupleOffset;
+
 	/*
 	 * The items array is always ordered in index order (ie, increasing
 	 * indexoffset).  When scanning backwards it is convenient to fill the
@@ -578,7 +763,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -680,6 +865,57 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupState
+{
+	Relation	rel;
+	/* Deduplication status info for entire page/operation */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	IndexTuple	newitem;
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+	bool		overlap;		/* Avoid overlapping posting lists? */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupState;
+
 /*
  * Constant definition for progress reporting.  Phase numbers must match
  * btbuildphasename.
@@ -725,6 +961,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+									 bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -743,7 +995,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1004,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1016,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1068,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		btm_safededup;
 } xl_btree_metadata;
 
 /*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if postingoff is set, this started out as an insertion
+ *				 into an existing posting tuple at the offset before
+ *				 offnum (i.e. it's a posting list split).  (REDO will
+ *				 have to update split posting list, too.)
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber postingoff;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem.  This
+ * corresponds to the xl_btree_insert postingoff-is-set case.  postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	OffsetNumber postingoff;	/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
  * block numbers aren't given.
  *
  * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
  */
 typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the updated versions of tuples
+	 * which follow array of offset numbers, needed when a posting list is
+	 * vacuumed without killing all of its logical tuples.
+	 */
+	uint32		nupdated;
+	uint32		ndeleted;
+
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 3f22a6c354..8535e4210b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1521,8 +1530,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..dde1d68d6f
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,710 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different.  Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique().  Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set.  Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list.  Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether.  checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState *state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState *) palloc(sizeof(BTDedupState));
+	state->rel = rel;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	state->newitem = newitem;
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	state->overlap = false;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state,
+												   RelationNeedsWAL(rel));
+
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		leftfree;
+
+					/* This calculation needs to match nbtsplitloc.c */
+					leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					/* Subtract predicted size of new high key */
+					leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+					/*
+					 * Reduce maxitemsize by an amount equal to target free
+					 * space on left half of page
+					 */
+					state->maxitemsize -= leftfree *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buffer, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->alltupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+	state->overlap = false;
+	if (state->newitem)
+	{
+		/* Might overlap with new item -- mark it as possible if it is */
+		if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+			state->overlap = true;
+	}
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists with checkingunique */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	if (state->overlap)
+	{
+		if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+		{
+			/*
+			 * newitem has heap TID in the range of the would-be new posting
+			 * list.  Avoid an immediate posting list split for caller.
+			 */
+			if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+				IndexRelationGetNumberOfAttributes(state->rel))
+			{
+				state->newitem = NULL;	/* avoid unnecessary comparisons */
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order.  Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps.  (This approach avoids any explicit WAL-logging of a
+ * posting list tuple.  This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b93b2a0ffd..0bfe9cdb7e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, OffsetNumber postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +304,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prev_all_dead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prev_all_dead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prev_all_dead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prev_all_dead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * deduplication is both possible and enabled, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (insertstate->itup_key->safededup &&
+				BtreeGetDoDedupOption(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->safededup);
+
+		/*
+		 * Don't check if the option is enabled, since no actual deduplication
+		 * will be done, just cleanup.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, checkingunique);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,39 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list by swapping new item's heap TID with
+		 * the rightmost heap TID from original posting list, and generating a
+		 * new version of the posting list that has new item's heap TID.
+		 *
+		 * Posting list splits work by modifying the overlapping posting list
+		 * as part of the same atomic operation that inserts the "new item".
+		 * The space accounting is kept simple, since it does not need to
+		 * consider posting list splits at all (this is particularly important
+		 * for the case where we also have to split the page).  Overwriting
+		 * the posting list with its post-split version is treated as an extra
+		 * step in either the insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID for xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+		/* Alter offset so that it goes after existing posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1133,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1213,13 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		/*
+		 * Posting list split requires an in-place update of the existing
+		 * posting list
+		 */
+		if (nposting)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1261,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.postingoff = postingoff;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1290,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.btm_safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1299,19 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1353,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1375,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1415,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1463,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1537,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1573,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1683,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1868,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1892,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff isn't set in the WAL record, so
+		 * recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  Recovery must
+		 * reconstruct nposting and newitem by calling _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+		{
+			if (xlrec.postingoff == InvalidOffsetNumber)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2090,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2446,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2303,6 +2560,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..77f443f7a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,6 +633,7 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
@@ -683,6 +699,56 @@ _bt_heapkeyspace(Relation rel)
 	return metad->btm_version > BTREE_NOVAC_VERSION;
 }
 
+/*
+ *	_bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 *
+		 * Note that we rely on the assumption that this field will be zero'ed
+		 * on indexes that were pg_upgrade'd.
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			_bt_relbuf(rel, metabuf);
+			return metad->btm_safededup;;
+		}
+
+		/* Cache the metapage data for next time */
+		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+											 sizeof(BTMetaPageData));
+		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+	/* We shouldn't have cached it if any of these fail */
+	Assert(metad->btm_magic == BTREE_MAGIC);
+	Assert(metad->btm_version >= BTREE_MIN_VERSION);
+	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
+	Assert(metad->btm_fastroot != P_NONE);
+
+	return metad->btm_safededup;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -983,14 +1049,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
+
+	/* XLOG stuff, buffer for updateds */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page. */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1124,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nupdated = nupdatable;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1139,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1041,6 +1160,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1067,8 +1271,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2066,6 +2270,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..2cdc3d499f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	}
 	else
 	{
-		StdRdOptions *relopts;
+		BtreeOptions *relopts;
 		float8		cleanup_scale_factor;
 		float8		prev_num_heap_tuples;
 
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
 		 * original tuples count.
 		 */
-		relopts = (StdRdOptions *) info->index->rd_options;
+		relopts = (BtreeOptions *) info->index->rd_options;
 		cleanup_scale_factor = (relopts &&
 								relopts->vacuum_cleanup_index_scale_factor >= 0)
 			? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1188,8 +1191,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1229,6 +1241,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1238,11 +1254,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1274,7 +1351,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
 			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
 			 * that.
 			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1300,7 +1378,7 @@ restart:
 			if (blkno > vstate->lastBlockVacuumed)
 				vstate->lastBlockVacuumed = blkno;
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1315,6 +1393,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1324,15 +1403,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c954926f2d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +652,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +687,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +802,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1230,6 +1336,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* Initialize remaining insertion scan key fields */
 	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.safededup = false;	/* unused */
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1451,6 +1558,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1484,9 +1592,31 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1649,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1698,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Set up state to return posting list, and remember last
+					 * "logical" tuple (since we'll return it first)
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Remember additional logical tuples (use desc order to
+					 * be consistent with order of entire scan)
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1742,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1756,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1770,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index fc7d43a0f3..ad961c305f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState *dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,13 +715,14 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
 		state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
 	else
-		state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
-														  BTREE_DEFAULT_FILLFACTOR);
+		state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+													   BTREE_DEFAULT_FILLFACTOR);
 	/* no parent level, yet */
 	state->btps_next = NULL;
 
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState *dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1195,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup &&
+		BtreeGetDoDedupOption(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1295,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1310,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState *dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		dstate->newitem = NULL;
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->overlap = false;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(BTMaxItemSize(state->btps_page),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1424,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..8078522b5c 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -51,6 +51,7 @@ typedef struct
 	Size		newitemsz;		/* size of newitem (includes line pointer) */
 	bool		is_leaf;		/* T if splitting a leaf page */
 	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	bool		is_deduped;		/* T if posting list truncation expected */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
@@ -167,7 +168,7 @@ _bt_findsplitloc(Relation rel,
 
 	/* Count up total space in data items before actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+	leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -177,12 +178,16 @@ _bt_findsplitloc(Relation rel,
 	state.newitemsz = newitemsz;
 	state.is_leaf = P_ISLEAF(opaque);
 	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.is_deduped = state.is_leaf && BtreeGetDoDedupOption(rel);
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +464,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +474,31 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple.
+		 *
+		 * Individual posting lists often take up a significant fraction of
+		 * all space on a page.  Failing to consider that the new high key
+		 * won't need to store the posting list a second time really matters.
+		 */
+		if (state->is_leaf && state->is_deduped)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +521,11 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead.
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsz) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +722,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7669a1a66f..ac8e403635 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -98,8 +99,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +107,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+		_bt_safededup(rel);
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1373,6 +1385,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1547,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1787,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2014,7 +2053,18 @@ BTreeShmemInit(void)
 bytea *
 btoptions(Datum reloptions, bool validate)
 {
-	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+		offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BtreeOptions, deduplication)}
+	};
+
+	return (bytea *) build_reloptions(reloptions, validate,
+									  RELOPT_KIND_BTREE,
+									  sizeof(BtreeOptions),
+									  tab, lengthof(tab));
 }
 
 /*
@@ -2127,6 +2177,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2143,6 +2211,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2150,6 +2220,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2157,7 +2245,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2175,6 +2264,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2187,7 +2277,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2198,9 +2288,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2213,7 +2306,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2222,7 +2315,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2303,13 +2397,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2387,22 +2484,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2446,12 +2551,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2477,7 +2582,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2547,11 +2656,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the "deduplicate" storage
+	 * parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 44f6283950..d36d31c758 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->btm_safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->postingoff == InvalidOffsetNumber)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+
+			/*
+			 * A posting list split occurred during insertion.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			Assert(isleaf);
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_swap_posting() */
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState *state;
+
+		state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+		state->maxitemsize = BTMaxItemSize(page);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->newitem = NULL;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		state->overlap = false;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval to
+		 * deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nupdated > 0)
+			{
+				OffsetNumber *updatedoffsets;
+				IndexTuple	updated;
+				Size		itemsz;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+				updatedoffsets = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				updated = (IndexTuple) ((char *) updatedoffsets +
+										xlrec->nupdated * sizeof(OffsetNumber));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nupdated; i++)
+				{
+					PageIndexTupleDelete(page, updatedoffsets[i]);
+
+					itemsz = MAXALIGN(IndexTupleSize(updated));
+
+					if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+					updated = (IndexTuple) ((char *) updated + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -820,7 +988,9 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; postingoff %u",
+								 xlrec->offnum, xlrec->postingoff);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff,
+								 xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nupdated,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 98c917bf7a..b2b29a1ae2 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3542545de5..8b1223a817 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,72 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1120,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2044,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2109,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2189,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2197,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2653,25 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	Assert(state->heapkeyspace);
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Make sure that tuple type (pivot vs non-pivot) matches caller's
+	 * expectation
+	 */
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return BTreeTupleGetHeapTID(itup);
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..a231bbe1f2 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,55 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is unique, since the
+  item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys prior to the heap item
+   pointer are folded together into a compressed representation called
+   a <quote>posting list</quote>. The user-visible keys appear only
+   once, followed by a simple list of heap item pointers.  Posting
+   lists are formed at the point where an insertion would otherwise
+   have to split the page.  This can greatly increase index space
+   efficiency with data sets where each distinct key appears a few
+   times on average.  Cases that don't benefit will incur a small
+   performance penalty.
+  </para>
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.
+   Deduplication is not supported with nondeterministic collations,
+   nor is it supported with <literal>INCLUDE</literal> indexes.  The
+   deduplication storage parameter must be set to
+   <literal>ON</literal> for new posting lists to be formed
+   (deduplication is enabled by default in the case of non-unique
+   indexes).
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Unique indexes and deduplication</title>
+
+  <para>
+   Unique indexes can also use deduplication.  This can be useful with
+   unique indexes that are prone to becoming bloated despite
+   aggressive vacuuming.  Deduplication may delay leaf page splits for
+   long enough that vacuuming can prevent unnecesary page splits
+   altogether.
   </para>
 
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..2261226965 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,38 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+    <term><literal>deduplication</literal>
+     <indexterm>
+      <primary><varname>deduplication</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      This setting controls usage of the B-tree deduplication
+      technique described in <xref linkend="btree-deduplication"/>.
+      Defaults to <literal>ON</literal> for non-unique indexes, and
+      <literal>OFF</literal> for unique indexes.  (Alternative
+      spellings of <literal>ON</literal> and <literal>OFF</literal>
+      are allowed as described in <xref linkend="config-setting"/>.)
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplication</literal> off via <command>ALTER
+      INDEX</command> prevents future insertions from triggering
+      deduplication, but does not in itself make existing posting list
+      tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +476,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
@@ -831,6 +859,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
 </programlisting>
   </para>
 
+  <para>
+   To create a unique index with deduplication enabled:
+<programlisting>
+CREATE UNIQUE INDEX title_idx ON films (title) WITH (deduplication = on);
+</programlisting>
+  </para>
+
   <para>
    To create a <acronym>GIN</acronym> index with fast updates disabled:
 <programlisting>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 10881ab03a..c9a5349019 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -58,8 +58,9 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR
 
     <listitem>
      <para>
-      You have altered a storage parameter (such as fillfactor)
-      for an index, and wish to ensure that the change has taken full effect.
+      You have altered a storage parameter (such as fillfactor or
+      deduplication) for an index, and wish to ensure that the change has
+      taken full effect.
      </para>
     </listitem>
 
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index acab8e0b11..de55d3cc7c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,6 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+--
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 -- First create a tree that's at least three levels deep (i.e. has one level
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 48eaf4fe42..d175a19bf5 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -83,6 +83,23 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 
+--
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

#114

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Peter Geoghegan (#113)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Nov 18, 2019 at 05:26:37PM -0800, Peter Geoghegan wrote:

Attached is v24. This revision doesn't fix the problem with
xl_btree_insert record bloat, but it does fix the bitrot against the
master branch that was caused by commit 50d22de9. (This patch has had
a surprisingly large number of conflicts against the master branch
recently.)

Please note that I have moved this patch to next CF per this last
update. Anastasia, the ball is waiting on your side of the field, as
the CF entry is marked as waiting on author for some time now.
--
Michael

#115

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#113)

4 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Nov 18, 2019 at 5:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v24. This revision doesn't fix the problem with
xl_btree_insert record bloat

Attached is v25. This version:

* Adds more documentation.

* Adds a new GUC -- bree_deduplication.

A new GUC seems necessary. Users will want to be able to configure the
feature system-wide. A storage parameter won't let them do that --
only a GUC will. This also makes it easy to enable the feature with
unique indexes.

* Fixes the xl_btree_insert record bloat issue.

* Fixes a smaller issue with VACUUM/xl_btree_vacuum record bloat.

We shouldn't be using noticeably more WAL than before, at least in
cases that don't use deduplication. These two items fix cases where
that was possible.

There is a new refactoring patch including with v25 that helps with
the xl_btree_vacuum issue. This new patch removes unnecessary "pin
scan" code used by B-Tree VACUUMs, which was effectively disabled by
commit 3e4b7d87 without being removed. This is independently useful
work that I planned on doing already, that also cleans things up for
VACUUM with posting list tuples. It reclaims some space within the
xl_btree_vacuum record type that was wasted (we don't even use the
lastBlockVacuumed field anymore), allowing us to use that space for
new deduplication-related fields without increasing total WAL space.

Anastasia: I hope to be able to commit the first patch before too
long. It would be great if you could review that.

--
Peter Geoghegan

Attachments:

v25-0004-DEBUG-Show-index-values-in-pageinspect.patchapplication/x-patch; name=v25-0004-DEBUG-Show-index-values-in-pageinspect.patchDownload

From fa83d38bfd1ad868b22ad5fc390447be81c1c704 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v25 4/4] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 65 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 17f7ad186e..4eab8df098 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +263,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 1d45cd5c1e..3da5f37c3e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v25-0003-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/x-patch; name=v25-0003-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From 826ac5d3ffc05285a7549d64e47472f73231fe40 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v25 3/4] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 111 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   6 +
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  36 ++++++
 doc/src/sgml/pageinspect.sgml                 |  80 +++++++------
 4 files changed, 181 insertions(+), 52 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..17f7ad186e 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..1d45cd5c1e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,6 +41,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +57,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..70f1ab0467 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,39 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..1763e9c6f0 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -329,11 +329,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +356,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +414,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v25-0002-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v25-0002-Add-deduplication-to-nbtree.patchDownload

From 0efdbf168fc324b0173cbc1a2019c4748d5f312a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v25 2/4] Add deduplication to nbtree

---
 src/include/access/nbtree.h                   | 329 +++++++-
 src/include/access/nbtxlog.h                  |  71 +-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              |  74 +-
 src/backend/access/nbtree/nbtdedup.c          | 715 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 343 ++++++++-
 src/backend/access/nbtree/nbtpage.c           | 238 +++++-
 src/backend/access/nbtree/nbtree.c            | 167 +++-
 src/backend/access/nbtree/nbtsearch.c         | 250 +++++-
 src/backend/access/nbtree/nbtsort.c           | 204 ++++-
 src/backend/access/nbtree/nbtsplitloc.c       |  36 +-
 src/backend/access/nbtree/nbtutils.c          | 204 ++++-
 src/backend/access/nbtree/nbtxlog.c           | 236 +++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  25 +-
 src/backend/utils/misc/guc.c                  |  28 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 180 ++++-
 doc/src/sgml/btree.sgml                       | 123 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  33 +
 doc/src/sgml/maintenance.sgml                 |   8 +
 doc/src/sgml/ref/create_index.sgml            |  44 +-
 doc/src/sgml/ref/reindex.sgml                 |   5 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 29 files changed, 3131 insertions(+), 245 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9833cc10bd..1482d5ab1a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,17 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* deduplication GUC modes */
+typedef enum DeduplicationMode
+{
+	DEDUP_OFF = 0,		/* disabled */
+	DEDUP_ON,			/* enabled generally */
+	DEDUP_NONUNIQUE		/* enabled with non-unique indexes only (default) */
+} DeduplicationMode;
+
+/* GUC parameter */
+extern int	btree_deduplication;
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,6 +119,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -115,7 +127,8 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *
  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
@@ -132,8 +145,8 @@ typedef struct BTMetaPageData
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number in metapage */
 #define BTREE_VERSION	4		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
-#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
+#define BTREE_MIN_VERSION	2	/* minimum supported version */
+#define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
@@ -155,6 +168,26 @@ typedef struct BTMetaPageData
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -230,16 +263,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -283,20 +315,103 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication folds together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation.  A posting list is
+ * an array of ItemPointerData elements.  Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -327,40 +442,69 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Set the heap TID attribute for a pivot tuple
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -435,6 +579,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -470,6 +619,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -508,10 +658,70 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  This will be -1 in rare cases where
+	 * the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupStateData
+{
+	Relation	rel;
+	/* Deduplication status info for entire page/operation */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	IndexTuple	newitem;
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+	bool		overlap;		/* Avoid overlapping posting lists? */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -535,7 +745,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -568,6 +781,12 @@ typedef struct BTScanPosData
 	 */
 	int			nextTupleOffset;
 
+	/*
+	 * Posting list tuples use postingTupleOffset to store the current
+	 * location of the tuple that is returned multiple times.
+	 */
+	int			postingTupleOffset;
+
 	/*
 	 * The items array is always ordered in index order (ie, increasing
 	 * indexoffset).  When scanning backwards it is convenient to fill the
@@ -579,7 +798,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -687,6 +906,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -695,8 +915,18 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplication : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique ? \
+	 btree_deduplication == DEDUP_ON : \
+	 btree_deduplication != DEDUP_OFF)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -743,6 +973,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState state,
+									 bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -761,7 +1007,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
@@ -769,6 +1016,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
 extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -779,7 +1027,9 @@ extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
 /*
@@ -829,6 +1079,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 71435a13b3..d387905cc0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,32 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		btm_safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it is.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting().
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	/* posting split offset (INSERT_POST only) */
+	/* new tuple that was inserted (or orignewitem in INSERT_POST case) */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -91,9 +103,18 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set.  This corresponds to
+ * the xl_btree_insert INSERT_POST case.  Note that postingoff will be set to
+ * zero (no posting split) when a posting list split occurs where both
+ * original posting list and newitem go on the right page, since recovery
+ * doesn't need to consider the posting list split at all.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +132,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -148,19 +185,25 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are actually deletions of "logical" tuples contained
+ * in an existing posting list tuple that will still have some remaining
+ * logical tuples once VACUUM finishes.
  *
  * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * item to delete or update.
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -241,6 +284,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 48377ace24..2b37afd9e5 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..1dbc32b70a
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,715 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different.  Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique().  Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set.  Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list.  Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether.  checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->rel = rel;
+
+	state->maxitemsize = BTMaxItemSize(page);
+	state->newitem = newitem;
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	state->overlap = false;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state,
+												   RelationNeedsWAL(rel));
+
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		leftfree;
+
+					/* This calculation needs to match nbtsplitloc.c */
+					leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					/* Subtract predicted size of new high key */
+					leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+					/*
+					 * Reduce maxitemsize by an amount equal to target free
+					 * space on left half of page
+					 */
+					state->maxitemsize -= leftfree *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buffer, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->alltupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* be tidy */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+	state->overlap = false;
+	if (state->newitem)
+	{
+		/* Might overlap with new item -- mark it as possible if it is */
+		if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+			state->overlap = true;
+	}
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists with checkingunique */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	if (state->overlap)
+	{
+		if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+		{
+			/*
+			 * newitem has heap TID in the range of the would-be new posting
+			 * list.  Avoid an immediate posting list split for caller.
+			 */
+			if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+				IndexRelationGetNumberOfAttributes(state->rel))
+			{
+				state->newitem = NULL;	/* avoid unnecessary comparisons */
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order.  Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps.  This approach avoids any explicit WAL-logging of a posting
+ * list tuple.  This is important because posting lists are often much larger
+ * than plain tuples.
+ *
+ * Caller should avoid assuming that the IndexTuple-wise key representation in
+ * newitem is bitwise equal to the representation used within oposting.  Note,
+ * in particular, that one may even be larger than the other.  This could
+ * occur due to differences in TOAST input state, for example.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b93b2a0ffd..d816c45f2c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+int			btree_deduplication = DEDUP_NONUNIQUE;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +65,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +130,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +359,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prev_all_dead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +383,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +449,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prev_all_dead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +545,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +603,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prev_all_dead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +624,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prev_all_dead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +671,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +741,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +804,25 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * deduplication is both possible and enabled, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (insertstate->itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +904,38 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->safededup);
+
+		/*
+		 * Don't check if the option is enabled, since no actual deduplication
+		 * will be done, just cleanup.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, checkingunique);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +1001,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1016,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1035,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1057,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1069,39 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list by swapping new item's heap TID with
+		 * the rightmost heap TID from original posting list, and generating a
+		 * new version of the posting list that has new item's heap TID.
+		 *
+		 * Posting list splits work by modifying the overlapping posting list
+		 * as part of the same atomic operation that inserts the "new item".
+		 * The space accounting is kept simple, since it does not need to
+		 * consider posting list splits at all (this is particularly important
+		 * for the case where we also have to split the page).  Overwriting
+		 * the posting list with its post-split version is treated as an extra
+		 * step in either the insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID for xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+		/* Alter offset so that it goes after existing posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1134,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1214,13 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		/*
+		 * Posting list split requires an in-place update of the existing
+		 * posting list
+		 */
+		if (nposting)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1120,8 +1266,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1144,6 +1301,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.btm_safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1310,28 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+			{
+				/*
+				 * Must explicitly log posting off before newitem in case of
+				 * posting list split.
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1373,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1395,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1435,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1483,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1557,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1593,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1703,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1888,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1912,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  Recovery must
+		 * reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != 0)
+		{
+			if (xlrec.postingoff == 0)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2110,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2466,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2303,6 +2580,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 66c79623cf..3b49eb0762 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.btm_safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,6 +633,7 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
@@ -683,6 +699,56 @@ _bt_heapkeyspace(Relation rel)
 	return metad->btm_version > BTREE_NOVAC_VERSION;
 }
 
+/*
+ *	_bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 *
+		 * Note that we rely on the assumption that this field will be zero'ed
+		 * on indexes that were pg_upgrade'd.
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			_bt_relbuf(rel, metabuf);
+			return metad->btm_safededup;;
+		}
+
+		/* Cache the metapage data for next time */
+		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+											 sizeof(BTMetaPageData));
+		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+	/* We shouldn't have cached it if any of these fail */
+	Assert(metad->btm_magic == BTREE_MAGIC);
+	Assert(metad->btm_version >= BTREE_MIN_VERSION);
+	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
+	Assert(metad->btm_fastroot != P_NONE);
+
+	return metad->btm_safededup;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -968,27 +1034,73 @@ _bt_page_recyclable(Page page)
  * deleting the page it points to.
  *
  * This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given deletable array *must* be sorted in ascending order.
+ * Also, the given deletable and updateitemnos arrays *must* be sorted in
+ * ascending order.
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate recovery conflicts by accessing the heap inline, whereas VACUUMs
  * can rely on the initial heap scan taking care of the problem (pruning would
- * have generated the conflicts needed for hot standby already).
+ * have generated the conflicts needed for hot standby already).  Also,
+ * VACUUMs must deal with the case where posting list tuples have some dead
+ * TIDs, and some remaining TIDs that must not be killed.
  */
 void
-_bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
-					int ndeletable)
+_bt_delitems_vacuum(Relation rel, Buffer buf,
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
 
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff, buffer for updated */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuple updates */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/*
+		 * Delete the old posting tuple first.  This will also clear the
+		 * LP_DEAD bit. (It would be correct to leave it set, but we're going
+		 * to unset the BTP_HAS_GARBAGE bit anyway.)
+		 */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1015,6 +1127,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1025,8 +1138,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
 		 * is.  When XLogInsert stores the whole buffer, the offsets array
 		 * need not be stored too.
 		 */
-		XLogRegisterBufData(0, (char *) deletable, ndeletable *
-							sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1036,6 +1163,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Delete item(s) from a btree page during single-page cleanup.
  *
@@ -1046,7 +1258,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
  *
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own recovery conflicts by accessing
- * the heap.  See comments for _bt_delitems_vacuum.
+ * the heap, and doesn't handle updating posting list tuples.  See comments
+ * for _bt_delitems_vacuum.
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1062,8 +1275,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2061,6 +2274,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.btm_safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index bbc1376b0a..8a67193152 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1151,8 +1153,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1185,6 +1196,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1194,11 +1209,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that it's okay that
@@ -1221,8 +1234,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1230,11 +1306,12 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -1249,6 +1326,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1258,15 +1336,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1309,6 +1388,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c954926f2d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +652,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +687,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +802,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1230,6 +1336,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 	/* Initialize remaining insertion scan key fields */
 	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.safededup = false;	/* unused */
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1451,6 +1558,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1484,9 +1592,31 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1649,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1698,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Set up state to return posting list, and remember last
+					 * "logical" tuple (since we'll return it first)
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Remember additional logical tuples (use desc order to
+					 * be consistent with order of entire scan)
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i));
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1742,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1756,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1770,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1dd39a9535..b40559d45f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1195,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1294,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1309,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		dstate->newitem = NULL;
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->overlap = false;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(BTMaxItemSize(state->btps_page),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1423,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 29167f1ef5..ffec42e78a 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -51,6 +51,7 @@ typedef struct
 	Size		newitemsz;		/* size of newitem (includes line pointer) */
 	bool		is_leaf;		/* T if splitting a leaf page */
 	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	bool		is_deduped;		/* T if posting list truncation expected */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
@@ -177,12 +178,16 @@ _bt_findsplitloc(Relation rel,
 	state.newitemsz = newitemsz;
 	state.is_leaf = P_ISLEAF(opaque);
 	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.is_deduped = state.is_leaf && BTGetUseDedup(rel);
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +464,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +474,31 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple.
+		 *
+		 * Individual posting lists often take up a significant fraction of
+		 * all space on a page.  Failing to consider that the new high key
+		 * won't need to store the posting list a second time really matters.
+		 */
+		if (state->is_leaf && state->is_deduped)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +521,11 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead.
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsz) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +722,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ee972a1465..cb6a5b9335 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -98,8 +99,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +107,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
 	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+		_bt_safededup(rel);
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1373,6 +1385,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1547,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1787,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2056,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplication)}
 
 	};
 
@@ -2138,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2161,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2186,6 +2265,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2198,7 +2278,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2289,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2307,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2316,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2398,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2398,22 +2485,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2552,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2583,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2657,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the "deduplicate" storage
+	 * parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72a601bb22..191ab63a9b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->btm_safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (likely(!posting))
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +370,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +457,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState state;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+
+		state->maxitemsize = BTMaxItemSize(page);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->newitem = NULL;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		state->overlap = false;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval to
+		 * deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -395,7 +551,38 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		/*
+		 * Must update posting list tuples before deleting whole items, since
+		 * offset numbers are based on original page contents
+		 */
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			/* Handle posting tuples */
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				PageIndexTupleDelete(page, updatedoffsets[i]);
+
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+								false, false) == InvalidOffsetNumber)
+					elog(PANIC, "failed to add updated posting list item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -729,17 +916,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -747,6 +939,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -772,6 +967,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 497f8dc77e..23e951aa9e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff, xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +143,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..6b5d36de57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -363,6 +364,23 @@ static const struct config_enum_entry backslash_quote_options[] = {
 	{NULL, 0, false}
 };
 
+/*
+ * Although only "on", "off", and "nonunique" are documented, we accept all
+ * the likely variants of "on" and "off".
+ */
+static const struct config_enum_entry btree_deduplication_options[] = {
+	{"off", DEDUP_OFF, false},
+	{"on", DEDUP_ON, false},
+	{"nonunique", DEDUP_NONUNIQUE, false},
+	{"false", DEDUP_OFF, true},
+	{"true", DEDUP_ON, true},
+	{"no", DEDUP_OFF, true},
+	{"yes", DEDUP_ON, true},
+	{"0", DEDUP_OFF, true},
+	{"1", DEDUP_ON, true},
+	{NULL, 0, false}
+};
+
 /*
  * Although only "on", "off", and "partition" are documented, we
  * accept all the likely variants of "on" and "off".
@@ -4271,6 +4289,16 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"btree_deduplication", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&btree_deduplication,
+		DEDUP_NONUNIQUE, btree_deduplication_options,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"bytea_output", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Sets the output format for bytea."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 46a06ffacd..0b8aa56b3a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -650,6 +650,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#btree_deduplication = 'nonunique'	# off, on, or nonunique
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index df26826993..7e55c0ff90 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3542545de5..8b1223a817 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +997,72 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1120,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2044,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2109,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2189,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2197,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2653,25 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	Assert(state->heapkeyspace);
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Make sure that tuple type (pivot vs non-pivot) matches caller's
+	 * expectation
+	 */
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return BTreeTupleGetHeapTID(itup);
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..13d9b2ff96 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,130 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values) a
+   storage size of about <emphasis>one third</emphasis> of the
+   unoptimized case is expected.  There is often a direct benefit for
+   queries, as well as an indirect benefit due to reduced I/O during
+   routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   only used with non-unique indexes.  The
+   <literal>deduplication</literal> storage parameter can be used to
+   override the configuration paramater for individual indexes.  See
+   <xref linkend="sql-createindex-storage-parameters"/> from the
+   <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Unique indexes and Deduplication</title>
+
+  <para>
+   Unique indexes can also use deduplication, despite the fact that
+   unique indexes do not <emphasis>logically</emphasis> contain
+   duplicates; implementation-level <emphasis>physical</emphasis>
+   duplicates may still be present.  Unique indexes that are prone to
+   becoming bloated due to a short term burst in updates are good
+   candidates.  <command>VACUUM</command> will eventually remove dead
+   versions of tuples from unique indexes, but it may not be possible
+   for it to do so before some number of <quote>unnecessary</quote>
+   page splits have taken place.  Deduplication can prevent these page
+   splits from happening.  Note that page splits can only be reversed
+   by <command>VACUUM</command> when the page is
+   <emphasis>completely</emphasis> empty, which isn't expected in this
+   scenario.
+  </para>
+  <para>
+   In other cases, deduplication can be effective with unique indexes
+   just because of the presence of many <literal>NULL</literal> values
+   in the unique index.  The influence of <xref
+   linkend="guc-vacuum-cleanup-index-scale-factor"/> must also be
+   considered.
+  </para>
+  <para>
+   For more information about automatic and manual vacuuming, see
+   <xref linkend="routine-vacuuming"/>.  Note that the heap-only tuple
+   (<acronym>HOT</acronym>) optimization can also prevent page splits
+   caused only by versioned tuples rather than by insertions of new
+   values.
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.  In
+   practice almost all datatypes support deduplication, though
+   <type>numeric</type> is a notable exception (the <quote>display
+   scale</quote> feature makes it impossible to enable deduplication
+   without losing useful information about equal <type>numeric</type>
+   datums).  Deduplication is not supported with nondeterministic
+   collations, nor is it supported with <literal>INCLUDE</literal>
+   indexes.
+  </para>
+  <para>
+   Note that a multicolumn index is only considered to have duplicates
+   when there are index entries that repeat entire
+   <emphasis>combinations</emphasis> of values (the values stored in
+   each and every column must be equal).
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..6f89e4a51f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8000,6 +8000,39 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplication" xreflabel="btree_deduplication">
+      <term><varname>btree_deduplication</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>btree_deduplication</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls the use of deduplication within B-Tree indexes.
+        Deduplication is an optimization that reduces the storage size
+        of indexes by storing equal index keys only once.  See <xref
+        linkend="btree-deduplication"/> for more information.
+       </para>
+
+       <para>
+        In addition to <literal>off</literal>, to disable, there are
+        two modes: <literal>on</literal>, and
+        <literal>nonunique</literal>.  When
+        <varname>btree_deduplication</varname> is set to
+        <literal>nonunique</literal>, the default, deduplication is
+        only used for non-unique B-Tree indexes.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index ec8bdcd7a4..695aa9123d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -887,6 +887,14 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
    might be worthwhile to reindex periodically just to improve access speed.
   </para>
 
+  <tip>
+  <para>
+   Enabling B-tree deduplication in unique indexes can be an effective
+   way to control index bloat in extreme cases.  See <xref
+   linkend="btree-deduplication-unique"/> for details.
+  </para>
+  </tip>
+
   <para>
    <xref linkend="sql-reindex"/> can be used safely and easily in all cases.
    This command requires an <literal>ACCESS EXCLUSIVE</literal> lock by
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..abc7db4820 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+    <term><literal>deduplication</literal>
+     <indexterm>
+      <primary><varname>deduplication</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref linkend="guc-btree-deduplication"/>.
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to override GUC.
+      (Alternative spellings of <literal>ON</literal> and
+      <literal>OFF</literal> are allowed as described in <xref
+      linkend="config-setting"/>.)
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplication</literal> off via <command>ALTER
+      INDEX</command> prevents future insertions from triggering
+      deduplication, but does not in itself make existing posting list
+      tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +477,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
@@ -831,6 +860,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
 </programlisting>
   </para>
 
+  <para>
+   To create a unique index with deduplication enabled:
+<programlisting>
+CREATE UNIQUE INDEX title_idx ON films (title) WITH (deduplication = on);
+</programlisting>
+  </para>
+
   <para>
    To create a <acronym>GIN</acronym> index with fast updates disabled:
 <programlisting>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 10881ab03a..c9a5349019 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -58,8 +58,9 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR
 
     <listitem>
      <para>
-      You have altered a storage parameter (such as fillfactor)
-      for an index, and wish to ensure that the change has taken full effect.
+      You have altered a storage parameter (such as fillfactor or
+      deduplication) for an index, and wish to ensure that the change has
+      taken full effect.
      </para>
     </listitem>
 
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..53bcd1f30a 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..f008a5a55f 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

v25-0001-Remove-dead-pin-scan-code-from-nbtree-VACUUM.patchapplication/x-patch; name=v25-0001-Remove-dead-pin-scan-code-from-nbtree-VACUUM.patchDownload

From a5c2da1fb4c9b528bc2ea5563cc74b65a5fcc8c5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 20 Nov 2019 16:21:47 -0800
Subject: [PATCH v25 1/4] Remove dead "pin scan" code from nbtree VACUUM.

Finish off the work of commit 3e4b7d87 by completely removing the "pin
scan" code previously used by nbtree VACUUM:

* Don't track lastBlockVacuumed within nbtree.c VACUUM code anymore.

* Remove the lastBlockVacuumed field from xl_btree_vacuum WAL records
(nbtree leaf page VACUUM records).

* Remove the unnecessary extra call to _bt_delitems_vacuum() made
against the last block.  This occurred when VACUUM didn't have index
tuples to kill on the final block in the index, based on the assumption
that a final "pin scan" was still needed.   (Clearly a final pin scan
can never take place here, since the entire pin scan mechanism was
totally disabled by commit 3e4b7d87.)

Also, add a new ndeleted metadata field to xl_btree_vacuum, to replace
the unneeded lastBlockVacuumed field.  This isn't really needed either,
since we could continue to infer the array length in nbtxlog.c by using
the overall record length.  However, it will become useful when the
upcoming deduplication patch needs to add an "items updated" field to go
alongside it (besides, it doesn't seem like a good idea to leave the
xl_btree_vacuum struct without any fields; the C standard says that
that's undefined).

Discussion: https://postgr.es/m/CAH2-Wzn2pSqEOcBDAA40CnO82oEy-EOpE2bNh_XL_cfFoA86jw@mail.gmail.com
---
 src/include/access/nbtree.h           |  3 +-
 src/include/access/nbtxlog.h          | 25 ++-----
 src/backend/access/nbtree/nbtpage.c   | 35 +++++-----
 src/backend/access/nbtree/nbtree.c    | 74 ++-------------------
 src/backend/access/nbtree/nbtxlog.c   | 95 +--------------------------
 src/backend/access/rmgrdesc/nbtdesc.c |  3 +-
 6 files changed, 28 insertions(+), 207 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 18a2a3e71c..9833cc10bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -779,8 +779,7 @@ extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *itemnos, int nitems,
-								BlockNumber lastBlockVacuumed);
+								OffsetNumber *deletable, int ndeletable);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
 /*
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..71435a13b3 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -150,32 +150,17 @@ typedef struct xl_btree_reuse_page
  * The WAL record can represent deletion of any number of index tuples on a
  * single index page when executed by VACUUM.
  *
- * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
- * For a non-MVCC index scans there is an additional correctness requirement
- * for applying these changes during recovery, which is that we must do one
- * of these two things for every block in the index:
- *		* lock the block for cleanup and apply any required changes
- *		* EnsureBlockUnpinned()
- * The purpose of this is to ensure that no index scans started before we
- * finish scanning the index are still running by the time we begin to remove
- * heap tuples.
- *
- * Any changes to any one block are registered on just one WAL record. All
- * blocks that we need to run EnsureBlockUnpinned() are listed as a block range
- * starting from the last block vacuumed through until this one. Individual
- * block numbers aren't given.
- *
- * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * Note that the WAL record in any vacuum of an index must have at least one
+ * item to delete.
  */
 typedef struct xl_btree_vacuum
 {
-	BlockNumber lastBlockVacuumed;
+	uint32		ndeleted;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..66c79623cf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -968,32 +968,27 @@ _bt_page_recyclable(Page page)
  * deleting the page it points to.
  *
  * This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given itemnos *must* appear in increasing order in the array.
+ * Also, the given deletable array *must* be sorted in ascending order.
  *
- * We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
- * we need to be able to pin all of the blocks in the btree in physical
- * order when replaying the effects of a VACUUM, just as we do for the
- * original VACUUM itself. lastBlockVacuumed allows us to tell whether an
- * intermediate range of blocks has had no changes at all by VACUUM,
- * and so must be scanned anyway during replay. We always write a WAL record
- * for the last block in the index, whether or not it contained any items
- * to be removed. This allows us to scan right up to end of index to
- * ensure correct locking.
+ * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
+ * generate recovery conflicts by accessing the heap inline, whereas VACUUMs
+ * can rely on the initial heap scan taking care of the problem (pruning would
+ * have generated the conflicts needed for hot standby already).
  */
 void
-_bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *itemnos, int nitems,
-					BlockNumber lastBlockVacuumed)
+_bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
+					int ndeletable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
 
+	Assert(ndeletable > 0);
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
 	/* Fix the page */
-	if (nitems > 0)
-		PageIndexMultiDelete(page, itemnos, nitems);
+	PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1019,7 +1014,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
 
-		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.ndeleted = ndeletable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1030,8 +1025,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		 * is.  When XLogInsert stores the whole buffer, the offsets array
 		 * need not be stored too.
 		 */
-		if (nitems > 0)
-			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+		XLogRegisterBufData(0, (char *) deletable, ndeletable *
+							sizeof(OffsetNumber));
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1050,8 +1045,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * Also, the given itemnos *must* appear in increasing order in the array.
  *
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but the WAL logging considerations are quite different.  See
- * comments for _bt_delitems_vacuum.
+ * the page, but it needs to generate its own recovery conflicts by accessing
+ * the heap.  See comments for _bt_delitems_vacuum.
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c67235ab80..bbc1376b0a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,6 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber lastBlockVacuumed;	/* highest blkno actually vacuumed */
-	BlockNumber lastBlockLocked;	/* highest blkno we've cleanup-locked */
 	BlockNumber totFreePages;	/* true total # of free pages */
 	TransactionId oldestBtpoXact;
 	MemoryContext pagedelcontext;
@@ -978,8 +976,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
-	vstate.lastBlockVacuumed = BTREE_METAPAGE;	/* Initialise at first block */
-	vstate.lastBlockLocked = BTREE_METAPAGE;
 	vstate.totFreePages = 0;
 	vstate.oldestBtpoXact = InvalidTransactionId;
 
@@ -1040,39 +1036,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
-	/*
-	 * Check to see if we need to issue one final WAL record for this index,
-	 * which may be needed for correctness on a hot standby node when non-MVCC
-	 * index scans could take place.
-	 *
-	 * If the WAL is replayed in hot standby, the replay process needs to get
-	 * cleanup locks on all index leaf pages, just as we've been doing here.
-	 * However, we won't issue any WAL records about pages that have no items
-	 * to be deleted.  For pages between pages we've vacuumed, the replay code
-	 * will take locks under the direction of the lastBlockVacuumed fields in
-	 * the XLOG_BTREE_VACUUM WAL records.  To cover pages after the last one
-	 * we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
-	 * against the last leaf page in the index, if that one wasn't vacuumed.
-	 */
-	if (XLogStandbyInfoActive() &&
-		vstate.lastBlockVacuumed < vstate.lastBlockLocked)
-	{
-		Buffer		buf;
-
-		/*
-		 * The page should be valid, but we can't use _bt_getbuf() because we
-		 * want to use a nondefault buffer access strategy.  Since we aren't
-		 * going to delete any items, getting cleanup lock again is probably
-		 * overkill, but for consistency do that anyway.
-		 */
-		buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
-								 RBM_NORMAL, info->strategy);
-		LockBufferForCleanup(buf);
-		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
-		_bt_relbuf(rel, buf);
-	}
-
 	MemoryContextDelete(vstate.pagedelcontext);
 
 	/*
@@ -1203,13 +1166,6 @@ restart:
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		LockBufferForCleanup(buf);
 
-		/*
-		 * Remember highest leaf page number we've taken cleanup lock on; see
-		 * notes in btvacuumscan
-		 */
-		if (blkno > vstate->lastBlockLocked)
-			vstate->lastBlockLocked = blkno;
-
 		/*
 		 * Check whether we need to recurse back to earlier pages.  What we
 		 * are concerned about is a page split that happened since we started
@@ -1245,9 +1201,9 @@ restart:
 				htup = &(itup->t_tid);
 
 				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
+				 * During Hot Standby we currently assume that it's okay that
+				 * XLOG_BTREE_VACUUM records do not produce conflicts. This is
+				 * only safe as long as the callback function depends only
 				 * upon whether the index tuple refers to heap tuples removed
 				 * in the initial heap scan. When vacuum starts it derives a
 				 * value of OldestXmin. Backends taking later snapshots could
@@ -1276,29 +1232,7 @@ restart:
 		 */
 		if (ndeletable > 0)
 		{
-			/*
-			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
-			 * all information to the replay code to allow it to get a cleanup
-			 * lock on all pages between the previous lastBlockVacuumed and
-			 * this page. This ensures that WAL replay locks all leaf pages at
-			 * some point, which is important should non-MVCC scans be
-			 * requested. This is currently unused on standby, but we record
-			 * it anyway, so that the WAL contains the required information.
-			 *
-			 * Since we can visit leaf pages out-of-order when recursing,
-			 * replay might end up locking such pages an extra time, but it
-			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
-			 * that.
-			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
-								vstate->lastBlockVacuumed);
-
-			/*
-			 * Remember highest leaf page number we've issued a
-			 * XLOG_BTREE_VACUUM WAL record for.
-			 */
-			if (blkno > vstate->lastBlockVacuumed)
-				vstate->lastBlockVacuumed = blkno;
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
 
 			stats->tuples_removed += ndeletable;
 			/* must recompute maxoff */
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 44f6283950..72a601bb22 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,107 +386,16 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
 
-	/*
-	 * This section of code is thought to be no longer needed, after analysis
-	 * of the calling paths. It is retained to allow the code to be reinstated
-	 * if a flaw is revealed in that thinking.
-	 *
-	 * If we are running non-MVCC scans using this index we need to do some
-	 * additional work to ensure correctness, which is known as a "pin scan"
-	 * described in more detail in next paragraphs. We used to do the extra
-	 * work in all cases, whereas we now avoid that work in most cases. If
-	 * lastBlockVacuumed is set to InvalidBlockNumber then we skip the
-	 * additional work required for the pin scan.
-	 *
-	 * Avoiding this extra work is important since it requires us to touch
-	 * every page in the index, so is an O(N) operation. Worse, it is an
-	 * operation performed in the foreground during redo, so it delays
-	 * replication directly.
-	 *
-	 * If queries might be active then we need to ensure every leaf page is
-	 * unpinned between the lastBlockVacuumed and the current block, if there
-	 * are any.  This prevents replay of the VACUUM from reaching the stage of
-	 * removing heap tuples while there could still be indexscans "in flight"
-	 * to those particular tuples for those scans which could be confused by
-	 * finding new tuples at the old TID locations (see nbtree/README).
-	 *
-	 * It might be worth checking if there are actually any backends running;
-	 * if not, we could just skip this.
-	 *
-	 * Since VACUUM can visit leaf pages out-of-order, it might issue records
-	 * with lastBlockVacuumed >= block; that's not an error, it just means
-	 * nothing to do now.
-	 *
-	 * Note: since we touch all pages in the range, we will lock non-leaf
-	 * pages, and also any empty (all-zero) pages that may be in the index. It
-	 * doesn't seem worth the complexity to avoid that.  But it's important
-	 * that HotStandbyActiveInReplay() will not return true if the database
-	 * isn't yet consistent; so we need not fear reading still-corrupt blocks
-	 * here during crash recovery.
-	 */
-	if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
-	{
-		RelFileNode thisrnode;
-		BlockNumber thisblkno;
-		BlockNumber blkno;
-
-		XLogRecGetBlockTag(record, 0, &thisrnode, NULL, &thisblkno);
-
-		for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
-		{
-			/*
-			 * We use RBM_NORMAL_NO_LOG mode because it's not an error
-			 * condition to see all-zero pages.  The original btvacuumpage
-			 * scan would have skipped over all-zero pages, noting them in FSM
-			 * but not bothering to initialize them just yet; so we mustn't
-			 * throw an error here.  (We could skip acquiring the cleanup lock
-			 * if PageIsNew, but it's probably not worth the cycles to test.)
-			 *
-			 * XXX we don't actually need to read the block, we just need to
-			 * confirm it is unpinned. If we had a special call into the
-			 * buffer manager we could optimise this so that if the block is
-			 * not in shared_buffers we confirm it as unpinned. Optimizing
-			 * this is now moot, since in most cases we avoid the scan.
-			 */
-			buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
-											RBM_NORMAL_NO_LOG);
-			if (BufferIsValid(buffer))
-			{
-				LockBufferForCleanup(buffer);
-				UnlockReleaseBuffer(buffer);
-			}
-		}
-	}
-#endif
-
-	/*
-	 * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
-	 * page. See nbtree/README for details.
-	 */
 	if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer)
 		== BLK_NEEDS_REDO)
 	{
-		char	   *ptr;
-		Size		len;
-
-		ptr = XLogRecGetBlockData(record, 0, &len);
+		char	   *ptr = XLogRecGetBlockData(record, 0, NULL);
 
 		page = (Page) BufferGetPage(buffer);
 
-		if (len > 0)
-		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
-
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
-
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
-		}
+		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..497f8dc77e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
-- 
2.17.1

#116

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#103)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Nov 12, 2019 at 3:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

* Decided to go back to turning deduplication on by default with
non-unique indexes, and off by default using unique indexes.

The unique index stuff was regressed enough with INSERT-heavy
workloads that I was put off, despite my initial enthusiasm for
enabling deduplication everywhere.

I have changed my mind about this again. I now think that it would
make sense to treat deduplication within unique indexes as a separate
feature that cannot be disabled by the GUC at all (though we'd
probably still respect the storage parameter for debugging purposes).
I have found that fixing the WAL record size issue has helped remove
what looked like a performance penalty for deduplication (but was
actually just a general regression). Also, I have found a way of
selectively applying deduplication within unique indexes that seems to
have no downside, and considerable upside.

The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page. We already cache the bounds of our initial binary
search in insert state, so we can reuse that information within
_bt_findinsertloc() when considering deduplication in unique indexes.

This heuristic makes sense because deduplication within unique indexes
should only target leaf pages that cannot possibly receive new values.
In many cases, the only reason why almost all primary key leaf pages
can ever split is because of non-HOT updates whose new HOT chain needs
a new, equal entry in the primary key. This is the case with your
standard identity column/serial primary key, for example (only the
rightmost page will have a page split due to the insertion of new
logical rows -- everything other variety of page split must be due to
new physical tuples/versions). I imagine that it is possible for a
leaf page to be a "mixture" of these two basic/general tendencies,
but not for long. It really doesn't matter if we occasionally fail to
delay a page split where that was possible, nor does it matter if we
occasionally apply deduplication when that won't delay a split for
very long -- pretty soon the page will split anyway. A split ought to
separate the parts of the keyspace that exhibit each tendency. In
general, we're only interested in delaying page splits in unique
indexes *indefinitely*, since in effect that will prevent them
*entirely*. (So the goal is *significantly* different to our general
goal for deduplication -- it's about buying time for VACUUM to run or
whatever, rather than buying space.)

This heuristic helps the TPC-C "old order" tables PK from bloating
quite noticeably, since that was the only unique index that is really
affected by non-HOT UPDATEs (i.e. the UPDATE queries that touch that
table happen to not be HOT-safe in general, which is not the case for
any other table). It doesn't regress anything else from TPC-C, since
there really isn't a benefit for other tables. More importantly, the
working/draft version of the patch will often avoid a huge amount of
bloat in a pgbench-style workload that has an extra index on the
pgbench_accounts table, to prevent HOT updates. The accounts primary
key (pgbench_accounts_pkey) hardly grows at all with the patch, but
grows 2x on master.

This 2x space saving seems to occur reliably, unless there is a lot of
contention on individual *pages*, in which case the bloat can be
delayed but not prevented. We get that 2x space saving with either
uniformly distributed random updates on pgbench_accounts (i.e. the
pgbench default), or with a skewed distribution that hashes the PRNG's
value. Hashing like this simulates a workload where there the skew
isn't concentrated in one part of the key space (i.e. there is skew,
but very popular values are scattered throughout the index evenly,
rather than being concentrated together in just a few leaf pages).

Can anyone think of an adversarial case, that we may not do so well on
with the new "only deduplicate within unique indexes when new item
already has a duplicate" strategy? I'm having difficulty identifying
some kind of worst case.

--
Peter Geoghegan

#117

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#116)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote:

The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page.

the working/draft version of the patch will often avoid a huge amount of
bloat in a pgbench-style workload that has an extra index on the
pgbench_accounts table, to prevent HOT updates. The accounts primary
key (pgbench_accounts_pkey) hardly grows at all with the patch, but
grows 2x on master.

I have numbers from my benchmark against my working copy of the patch,
with this enhanced design for unique index deduplication.

With an extra index on pgbench_accounts's abalance column (that is
configured to not use deduplication for the test), and with the aid
variable (i.e. UPDATEs on pgbench_accounts) configured to use skew, I
have a variant of the standard pgbench TPC-B like benchmark. The
pgbench script I used was as follows:

\set r random_gaussian(1, 100000 * :scale, 4.0)
\set aid abs(hash(:r)) % (100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

Results from interlaced 2 hour runs at pgbench scale 5,000 are as
follows (shown in reverse chronological order):

master_2_run_16.out: "tps = 7263.948703 (including connections establishing)"
patch_2_run_16.out: "tps = 7505.358148 (including connections establishing)"
master_1_run_32.out: "tps = 9998.868764 (including connections establishing)"
patch_1_run_32.out: "tps = 9781.798606 (including connections establishing)"
master_1_run_16.out: "tps = 8812.269270 (including connections establishing)"
patch_1_run_16.out: "tps = 9455.476883 (including connections establishing)"

The patch comes out ahead in the first 2 hour run, with later runs
looking like a more even match. I think that each run didn't last long
enough to even out the effects of autovacuum, but this is really about
index size rather than overall throughput, so it's not that important.
(I need to get a large server to do further performance validation
work, rather than just running overnight benchmarks on my main work
machine like this.)

The primary key index (pgbench_accounts_pkey) starts out at 10.45 GiB
in size, and ends at 12.695 GiB in size with the patch. Whereas with
master, it also starts out at 10.45 GiB, but finishes off at 19.392
GiB.

Clearly this is a significant difference -- the index is only ~65% of
its master-branch size with the patch. See attached tar archive with
logs, and pg_buffercache output after each run. (The extra index on
pgbench_accounts.abalance is pretty much the same size for
patch/master, since deduplication was disabled for the patch runs.)
And, as I said, I believe that we can make this unique index
deduplication stuff an internal thing that isn't even documented
(maybe a passing reference is appropriate when talking about general
deduplication).

--
Peter Geoghegan

Attachments:

overnight-benchmark.tar.gzapplication/x-gzip; name=overnight-benchmark.tar.gzDownload

��Q�]����.�����su�=$#��S�W�(��6`K�~���\{1������>�p������1�����'J���\~M�����C�������_~�����������o����ZSJ������O��?�?��?���k���?���L����?�d�����z�%����K��{���?�������*������?�����?��wz�{�_�������������S����_~�������_���?�j������O�Xc��o���u�������������%��1�����������_��w����?~����~�����_������o����������������Y�����������/��o�����������#��V~�f]�Y+�a����fgX�?��.��?���0o�G~+�����cX�Vj�Q��<M}ZI�a�+�m^�%����~�����mb�,�a�L�0����s�<SR1�L�a�9,�3.O<.�������m'���:����������u>�Ki_�1.������c*=e��C�����c+��g�q#��c,}�&�K:��=m,k����:N�s����\���5����4�|����k!��|�+��2R��g?�������m/�xz�t[������W��e��q3���l�q��2��9/�y�x��;?���8o�}O�,��>�7�{)��2��������;|���k?6��W�������������w�x���y�sV��a��s�{����<����w�c/���h����oT�P�����l��>�����-%=��}��\�����8�y����K+����Kjg���;�2�����j/�?��{�>t\�:�7��q~����R>�;�2;�CE@0��{:�������������B�>�����2K������v�v��������>�V�	�������E�71�S~���Lk����~2���2�P{1������^L����;����2���������	�Z��j���U��~b����^��Y�X�����K[;L���]�"����������(�����^�r��,�����o���~~MCy����K���O��z��j����S�\���S����:n>c���5/2��u;������_��=�{���`�v�/��KM�v��S�_;�n[+J�FY��~d�q��R�9������>��J~����5~���qa/y~^�+z�;��e�K�c����f�'���w������5�C\.�n?�2
�R�������������!qV����\���������RS���k��u�Ze��_�yS���M�7M�-��G��������A@�qplU�30��B"�LO('�>c��xbk��:�������F��e���������~.~#��?���3����6����!=�X��3�����;j<�a�	B�.����6��4p)��a9e�
}U���?M|��D�!�`�,����k�t��6�5����p���m1f�}�$q�]!����eM���].��8�����������,�|�V���&�hj��$3VS�a��uJ���%��_��VS���
cH�����Run�knd���nz�7]�Cy�c-�x��p�����DsJ�p�#�h(%|�.�S
�3�8���
2%F��j�;���K�l����K��7�P<��x���%���g�Tu�l��A�X1����sI�2������2���#-�@�Pm5�������o�G^";F&>Sf��y�0��A���)x�
BT�q��������.�%���&���w� 65�Q�����i��)?�B�p}d���?_v�D���|�5�%�#1�"��	�O2DY#��?�Q���N�-n-��g�0 ���B����5]#�v������W�$uxMmN�!��U]�������gBp�]G�
����� i�d��F<'��E�����Nm�AM\]��z����yV�����j�&�J>2q�`i�����d`���+�7
xb��3��:���������~���jy�S{�������<.(-K�������@�7;R��@.z�\h�����7�WbS���z������jwC��������|�NH���l<���W���lz}�a?��ZA���n`�3�������)^��~z����I�.��'*���P����x�@[�x
[�cf���W� ��Nm��3[�����#"�}��J�Y��##��Y���{��e�j�o������F�����Bi:C��4>b�z����xf���cBe�}EF:�
�~�����l��-?�t��
5��J�C���_My�����j]���p�C�>U�r
���kb���/P���:�e�xs���tGV|��a�qy�>������������u�U��x�=���"L��:���b����5&����+���vfG�3�*���sg���
U=7�\�����s�wG*������G�6!�X��Vlo�&
����^D6���1�T�w�8��Z�{�����������8���n���Z���������0N���m�g����y��uj�YJXb����!��lIE������n!=��Z����i���li�hi(�G��
���3X(��j�=�y�%6���3B~^���~X��&�0"���t���o�3�	������B��r�L�L��r����hmc����&����]C����d`Y7z��qi5��C#�1���������k?�a�8�zwyD��6]�����,���<%�0��#D7kUo-S�����w�������4W
I�\?�P�--7�?y��^/�h���&B}j�^x���a��7�S����1F6�&�n���2�>S-�$L,!G�g&�]��5������S������r�rz�gB���o�X��*H��������YSH$���u�X����Zm�\��G��t�u<��"*G�n� �Sso�}RB���X���R
�K�������2������aEG��W$G1���1�n�V����P��M�;���#���nkd1��m�=l����ZC��Y��6](����lw�������qt����ojF��k[q���i
������=z����I�P����j����B�������P�W���-H�4I�/�F/�����j��WB�^�G�2�P��F������oJAC	5�*}����b�5��izu��5z�-���?���N��F3�Lw�Rn�pFa4�E�*rP���^��G��6q�S����I\� �Hh���HE-��<��U�&���z��~�F�������3
|�A)da@;Y�p����0������W������w��%�����1 /�E�Bg���H��}�w�J�$�-!I�,�F�������iP5�T�3I�*!I�}@=�nm?��G����Z�����H��<QP�8i�R�|P�"]�u�SE�ey-�q��V������m���b���� k
#�D�d�P��q��q`(r�T���b���CPB����1���-W4�
E������B��Z���#�wz�w\����$E��Hs�������]����c�.�����Jp���$v�S�I�`�
�#��&���r7ZB��I�6���c� ��`/O�������
'�[1�����U����%i��:�L�5�
Az��:'b/��.!H�L�[�@��k�4?�#���BWB�FBjE��6!}����nO��C�����{���3�|�F!H���u�c�l�:�~
h$��]�!H{������.���Ud1e��aU�k@��@x�]�6�H�w>�:C��jJE�eF�Zq}���!�Q��V��P�?�D�R�7��YB���;NS�
�#H����
�)���GWK��Z���t�-��7��M���ddX�H�&RB,��������H�~�R�uAk��';0�����������"I3%r�k���3���nc!I�3D���=$�5C\�p�&*o���L��x@���2�4���Z�@�q����
�4E��#IW3ZmR�����k���YT
��*��H��Y3��.��B���P��EC�y]���*� \�ZA���3-���H�9z�\�ncG��6G(H��ebG����^/x�$�v�����O���w(����";����Z���'>bt��u=2z�'�������jIxU��u0*^5�~`6��'-q�=�8�U�����i:��VTPx�]:���`��4�I�(����S����{WC�!���=���>���w .����C,���x��Qy;�:����G�?�E��"�9�=���
�����r�=����+0u)i�{��TC)�@%����#A+��u���bNB�ro;yR�`�XE�#ha*� H>:��>�LN�2��Cb=Jb���	��i������F�n�������Y����1u���lk�r�g�W=���M`~�T�`���/2R�$[�"�P(HCo=�hTv�g[��������qG20�g�j��W�g����3���Z!�����T�g��7U�Q�[v��*$9p}�#�r:�c�7���M�MfR/7�<���V��H�"������a���Y:,r�;���cH�j��On8�t9�X�3��Y�w��|F�8��W�Y
~�k��z"�B���#7��ls[���E��#7��s����2^7�@h��Y��0��?�Y�!�W�E(0B��A�h5���)I��Fc�r2������6��0��

���r�-P�44$���$5{i� J����NNC��k��������7��SE�Mg5���2�2�[�y��G��tuK>"�
��]'���$����l���\W�[�?H��x({�����#[��\s����]����`<2�I�#���7J��G�����:�G&\�j�����Uf�e����Ft-��pG��d$5�(>-�e��+H������#���e"s��s �U����N %���h\"s����*W����?��9��c���y�Y�N�������t����$]�=N�+M 
�V�������c��3���o�a��������9���s���U!����FA�lc��0��m6��K��v���d��hv:�����s2�E����9VM�@�����6���A�{5`3lD��b��M=�n�mpa��#���H���
����*����c���r{�}��<S��[��_�
h������	�7x��z��w�K��,����4����Y�<������~(�	�S�h�%���5�W��� ��vu�2{��`Z�����k��ft�i%�E�s���?G""mY�hNd�Nr;$@�Hw�?�D��d�.w�~��*�WPLl���l�
}���~H�'�kz��mt��4$�
��0�O0z����&�������td)�����w�;ZTB&s�d��d3�%;�k��^��w��uT�j%�G�sv��j5���J����9q���%���w����=��pT&�PmC�%�0�{:�7��d��Jq��K�����9�vVuBYf�(h��	a�	���������7)����Gm��A�:�u�4��6��=mE�c��r����X�"�dYN.~���j��*)~��,�@�Id;���x�K�#��6�;Q��&��e��z����I��v�$�t������JDC*A��/�i�$8SY'I-�������m
^Lh*c���^;�gh{S��]TQ_�QG5�$�#����
�A"��.#���w������G�s)��L,�����0{�!�C�N�t��s-ag1����=cC����o������H��Hw.��i���V�q�,�RPj!h��>RR��v��c%O�i��V����U^x� ����/����-�H�^�=)��8jE�E�O�$�t�RYYI4v�`&��Kk#_��bA��g��)D����J��,u���S	�&��t�^q�b��*�^��Q���q����!EW����B�;$!����	�I�����[�n0�Qeq��;{�J���x�;����^()H����jBj��h�w���|��U�*n�3�-z��B��GCQ����:b�
�B�E~�=!$���VL��d��+�	,]�ERt�^0��B<��/����#/���Rt���q�������(�����(9t"x�>���m�������-@>�	H�dS���#E��Q6����&���#E/��j��U�15�#E�]� ��}(nj0�����W~Kq��F��_����.|c�"����A��f�i2�����P�m�����mS;5D�r!������.}#3���I4d�l��U��8�Fa��A�]'�c�n M�U��f����u71��cb������}%/�
X&�����[-���Y�����p@$� �B�3|>#w���F=�:Z��d�4��K��� �����Q-�0�,����x���h�V�/c��h�1�6�>�����tj��xU&���k4n(l!�1�#
by���-��z���(|Yb^B�@���
*������E��z���P�L���\(wL5�f�6��~1����S��5]j���F����'J���@=z����3�?���)�>��@�3���o�T �3KP�~�K��, @��H�R/}c'giI!��o7���N�"���Q���K��������^���"a^O[�\����P�� 1TL/�h���b;0A��\���l���3� �I�^�Ny��A��L�%��p)����0g��A���S/�R-3R��[U��a{C�"!�l����
2�?��'U���G���m%
���*�3����-g�����`���uF[�	�+��K����:�
��� �����QC����R�G��5�h�����8��=z��N��'fZ
�r$^��������yEWggM����B�
#Vg�"�_�O9zb�f���Z�Dl�G�^'0Qj��e��#H��!L��!�5r��=�\��I����!G���o��iUs9�6�R	$%W}���"�j���g�B����l����O�,c�S��o��8�BP���C��W�BA��v������nF�$XC��-�Z�W�BD�����$G0K����$8=zt������[A�9��]U� _HM/��<7!z�XP�^/|#��O������P����X�"RC�P�}C��h�MS���������
x
9��.�d�����7TG�������ol6�.�JV�<��7�e�f)jd���b~4��:���/����.��H��P?D��{4D}�f�fK���{�(���3�����N�������}c���t�=�����Q��Q��3���3J��WqzG��mr~p���/��W�����uF�y��F��"�h�5��gg�E�����{#��n���J1���pr���]�L�v!��"����IA�+#���Q3���/YA�l.���o��ds���$�C�O��>�����$V��K��t� h�|0w�d��r�L��rt��kS����=I#h~X�g^ �K�����y���3����L�]zWC�v�����P�e�'~[P�� R�����t�X���1���m�T+9�&^Ug��E�=�h�=��MBN�$���}g����b]/�k��_�n%��HG�n5����{�����RC���
��TpP�����
�����O!G��a�5������
gj�!ws����m]����;����f�r51plp�������1p�����H%#}�hwGm��]���j9*�v����������X���l���K���z�O�}����9A�����4-����uV��U�s�wm��sA��oL�Y��x��>����B'��C�C�l3�;���}S(r{���Uc����z�	�s}�a($M7�Lc f�z�7~6R|>�������Qt��W����9��QT��W���	�M~��U_�>�-�?i{�'6X����C���\~����C�
H�[!�4`iG~n��-�&8�����lUmu9K��
��#>[�d����]����#���n�V����~�',�GK�o�Q �1�yt}��g���J����<M�y����<�A�[C��P�9��Xs�%�
e�S�����}������������5w%S��vT�1'�c/Z/Z�7�iPHrrN��nu%����H���	�vs(���;�����<���*x��x�
�Q�:�k��
�D��It8�s�E�>oU��E�j��@�O{���:Go��L>���a������s���j���O<>g�
����mGl��i�Y�Wm�K��"�VL�h�������Eq�z����7^�F�="�S2�[�7�@+����]`-�������j���h�F�Tk���o4.+����k�*a��",��o��k�F���	�Fe��sr�Y����E�9�����J���{w�15�#1w~��.*�i�Q�yE��5 ����9%���#	Rw��o�g$V���64�&Tz`~.~�5�7�vYGF����R�`TCmGcn>wD�M������s@Gw��������rf��\��|�L��T�����'a�����oxc�� ���
R�����.=�}N�-���E_�\Q��S���������v���{�G�T����X�'�F�!2��M������zM��N�����~M�_�m/������#��v	�A�kM��6�#2���%�":Zk��1�}�������}�$3�Z�(�!@{tH��k��������:{���&:��P~o��������dq�j�2�}�V�E��%}�=,����r����	�u)���BeN�����P���p����= l(T�
4�}s���y��`\�}B�#2��`�yj��T��h�kM'M}I�`S!�H�+���r"�2��yv�71U��p�2�A�5��9��
��)�pdJ����LF�5�B�2���G�s.������)##W�x]N�D��;#W���G��ed������_�dd����w���=_4�m�N�0����p������b6�(zF��������R����+
/���'i7/B��X�����zh#k��3o��D_LS��
�a���ut�R*m���\���+>�G��:�,��`22R����R�^���^	R��g�D$=��-z|���u~"�9m,��� �Sj'�Ept��v�j��{��m�:�XB#����������h�_�d�~�5�j/��O���i{:C7
��~h����-����k�so�6��;��C�l**K�G�s�LOk�'}�M746��'�rZ���\RFF�j�v	� �/���rU�	��,�%A����^4�mJ`��g�Z�:M��&��f.�Yl����/�#Ol~�5�B���aL�����N�+!7/`����>/����'���RxBiI4<�kA5�A{���d����l����T���/�Nz��x�~Th�5D�]\�����6d\o���SN�=���"�g�3�t~p��r�/�c_K�H��hFJ�����d�~�@R^Y*�tF�J�'Q�u��Hg��{����Y1���*�B�8v.�R����X�<����](�z��G����O�SVM�I���D3�_�����X��R��M�G�^�F��D������A������+/��?&���j�_�kq2���N�5���(4����~�] _��A�7�y�w���KFf�����a����GC����]%��� �wAFz�+j	���mE���1C+���	z�rj���������.�EW�nlM�$%-�G��k����g&
-� �U�%��Yq���~�h�y9��^�`����N���`�/�c7��e�j���rt���@3�`�o����
YP�G�s�B-��qT-:�*("8j�����f�����_����jVH���t��&e^1WE��z���U����QL���;�������M���������'�\l�#G����z����1��v�����3����=f�(�_����p���"8vq���N��c^�&�)�'N�m�"8vi���B�r=7�������ej�z2��Ut�B"
������w�mM�N�t���(��}��znz+��bw��G�^;�8��ru�]����b�Rz���e�����Lm��G�eU��b�#]b��!t��3�!lN)���HL�'���0���U�(�0�;���U��SF�.��p�4�.��I��gY���.*���0�&� P�Tudd���$U��\���H���}C���*�qiG���Vp�=����M�AE:�v��1�`b3�]`�"8�'��AC<�Ep�pZ�-�g�q��89R��������.A�E�^��r��H/G���2i�����������S�r�b�!*�0���3�������!H��_�U� �}:� ��.M���T��O�s����^�T�d� ��%��1��z\���v��%�x� �����������;�R+>o��w���J��iM?��|��=��^~�$�U��#�~[��_'��u��+��t��o�G�mDG�:��h�y�M�H��GN�3�z��������4� ]�n�A��1�hC����JG��8G����i�^@��}�86����h8_D�K�(���P�����]�D��H��`]�d��P�}�r����h���z[�P����5��$�i�����y[�]2�F}����P]�d���I��=��+���
,��+P�#���|�v��Q����D���~?2%�aD~��P?	���D0"?�	T���J?B�v#O��+��uF�q4�%$�J�����f��Y�<un�.F^!O����g�''T�U.���+t��^����n](���sE��k���.����u7��N'h�6��=�
����/O,��1��~�nGo(6�� lB��U�dJ��F�}���q��f
������'Q�<�g��I�h��A������Xy����7���}������E�
��0��k-	&'��H��h����]
 ����	2wk����-�z���a��'���J�mL�B�k��G&5���}9�.��h��k7T����S���{�F�AI6�l�?3O>X
���C��fU,O��5.~��^!wk��#1�w.6�<����y�S5�z��$E�	U�k�q���[�f{v�d�f���AG�����q�������I�������7����5��N:	>�\�����n��0d%j��:�����?=��@A��f�]�(V�������4p�U��)]�g��@��H.�|�7�?�Pt������KU�����P��o���.o^�F5��������������6&�3u�Q��U<�����
S��������iA������y��
���,C���"�������|���1��l�����bUE*�G|n���Z�����%f�7KX;2�N�(��;����\4x��}�k��=�FB;2�g�7LK#6$�z�Cf�DVc�/�R9��<���f��v�y8�#��g��Q�����Uc���q��L����4��b��O<�S�n/�(��e^�K�n?��D�����a9��z�g��s ���~>�((m�r�������cr�<��fU�����20,���*��������&�\3��l�`k?�c�w}�(d����,+�kH@7#�M^��<��[r�7v�D%a�����9��<v!n���W���<J�������Mu�!4'��������Gh;���M����yt��}P���Rl.���X��7��yn���W*<�FHGf��W�v�"@a�Q�g}
�,����rZ��`���Y
��)�7P(j����p�/�G�a�eo�w�����0/{�(C���J���
��=C)��(����U`�����b��@+C.f���<���O��H7�ur�vU
���z�����[�\�
B�f2��lM����
U������-�����#/{F���kWn�G$�<������L9��o�+}��|���u]y�S���T��w���M7���t�sO�jJR�GU�����!�Z�+�C�������g���H��e�������x��%�-�b7P����P��fwde�,b�����W:��so�#
)iY���\�rf�%���7�9Mx2;5�rq���	�P$�����.Uzt�&�>�����L�]��n�/<���#Oz�.���4�m\��\O�#.���bE��:�K���,�Tb�����w
u�1��M����U3�0q�;aT�$l�����n��%���y�e����M��5[s���:6h��,���B7:K�f�8�G[����lZ�7��27�F���2}�k@�I!�F�s3���K��l�'5�3�����ro��m/��qI9IeSG�rDL\245�P���
e_l�BQ��^�6��NT�h������_�kL�g�y�i���s$[�����WU
i�.������R���NM��j����k[��J�%�g�M���*�X�=E9�|}����Q����k����7��w+ ���6��s
��t���t�=bs��
#iG���Sd1IHNk����;c^Mpp���i"�9�."�$CM���@(@s�����s�����m|����d���3�mK^�N�w�P�������{���H�����yA.��oj���q8WMg���$E�s�O8^�=�=>���.@�z����o���e7q����.�������4c�����Qo�0�����~Q�
/�c�Y�����k�9*��ub�z�.�!��"t5��B�w��V�r6�S�
a��Q�����fA�m��jL���+�t(�%M�k�A����D^�8��>���������E����M)�����@����L��������jS������5����enjJG��<eh��4���m�?�h9�������.�������#\C#�c/x��%��C|��"�G}:�6�C#�6������z�j���z@��8�H���A{���!�-�����u���EN��@'	75�S��)!^�FiM��^������[d7uJ�����a��5�J� _C/�����At���v�W+�U~���RQ�k]cv5���p�*�w�7�������Pq��"=�����H��%����Pn��������
�M�A�3��ev�B�"CZ����K}��3�s.>��vl����X8����TX�cG�����U!�;��]5���s"���h���7�������B���^�}��6!j��*��]�js�5���;���^~G���[�-�>tW!�]�MZ6��|a{l����]B�I�:2,i3.t�W���v�g��H|����C��C?���)������c��Cq������f
��N.[)����*.�FUt{��N(�����J���l0a�
��^��&������"���(a}��9�iD���Z�Hv �="w�*OwM��p��B]������
��
!��B���I���}2X�����8�_��Zah��5�tGPV�5��H�8%;�cw�L�R�
l�/���X����w}t2|�=32i!.f	G�u�n��QL����tx�G�G����*Y���M�b�b=�����r������L��0@i����,+�T��%�����)XqO�����$�%/"������p\�y�4]@�a(���j���4p��n�t����Q9M����B>���_X�S%5�� 9��CK��.�V��G������S���[���@R+R3Kch!�a��M�eP!^7,iw�S\+���'����<f�+�A!x�vwN<F��
KrV:�]��9$o��\�$$�iy��������b}�4]�G#��Mn���w�R������}l�\�����6�l..x��BE+�5��
�*�u0��^��#y��������kd�R�8��?*��)��&���u���%�g�_&o�
d�'���o(��k�D���9��,��M'
8�/�����e7<��LW�����X������B�D��f#��������w���)�T��9[d�$����#�K��#��Q.���td�?�{@��d�&��H��a�bv�$6O���L�T����C���&�����><��|���TK
��7�B���T��8^��.
����/�tHK`2+����1����/�&���Y
;�\��wz!��i�GRY*���zV���n,���S�h�>�5���H���;����Q���DQ<��V~U|��!y�3����pi��g@��j��m���WZ����R�-}�$�>$7\��^	o����FF!�^�������)FZ0������U:1e��{�����Y��G����VC�Q�=�td�V+hqD�lv�#y��BQ�\G��D������2�pT�Z���P�}�C���� ��>�k4�j���G�A�^������*�������07���"���w2����I�jH�}bY���}��|�W�|w�G�
���7�;;��rE�6�7�����Ik������nX����� i}__�@+%}���5�#y���GE�hm(��x���E(E��5 X!COp��������d%�b��C����B�Z�t�r�;��l�j86^`H��3
6x�!�S��'��N�N����!���FvuF�2�[�j�q�������/Z�[��=�v�ubI��������^��[��r�_w!�
�t�Nx��$%7@�*�r�y�=2��	�]�hJ7��'�0-���j�2�q"���t�o��nN�CQ0��p�4(�e��:������Ya�Z��s=�����k����T.G���b��������_F&��K0D�
u�F���P���x��X1��y("��g�5��S#�
��X�����<���#��]�^02R�4��_�4Y3
J�D�X#M3��.�r���6426(�J��;��3k��b�����9vI�cd��E	����g�a��<��O�e��W�Yc�,�Y`�	�	k���}�j+P��.��)�(R��NT!O�<`��	��
��3�"YO@����j������
�:����]4s�6�w���+������adn������"9;~�!�������Z�W�Zm�LfQ)h��Q�-R�sv�TJo.���u�l4s�]������
���[��`�\����4c]�����cC]���[
���tt���������j�F��lW%He_�z���>���c��|����G��Cqs�o��EG�����(�u
���+��|�v�?�AU�����o���V
v�#j�]#3TQM=�hd�?\o�U:����yJ�O����zMfB
���5B���8��T6�y���m�\�6bd~3�E��5��D�PP�Z9C����k������`Z8����t{vGvS�*�*��@���I���8�#�Mv��M��t�wM�6C����#d�6a����FFA��jc/�6VY@��������{v-M�#����X��1
���/�fq:��L����=������b}|�v�x��C������Hi��?2���*a����N;{PKf�1_��C�<AK��nWt$�&3�O���l��Gq�.��}L������
����@l�P�~=&7�:�qr#�����Q�\���_�� 7��s����y�����*[,�7D�~t
W����� ����0%kMRbc���������:�����n�c���s0��/G�^��
���Lg��<?��;���~�H�N�/)��a

S�jHO�\�j��uI���^����^(IK��TY����db�Jt�/��{�j5\�v��(��*`:9^��+G���;T��������>���"@+N�P���v�����r�-�v`y��#���gv��k#l4?�D-hERy�I��Y�K���~�R
��&�z	I�0�K���r�$���$D�M������	��7��J&3�gfT�th�$��R��
G��d�O����E������7����YD�aJ[*5XtcD~c����I����W����n�?�E���eQuh��A^�^`���,����Cq�mx�����(A�v���Q2��5�d�k�Q��F�4\�/�e�l]g|����>v%6d,��b�]F�3����z;f�Qb	�D�o�����?OS3=����@�� wh7�l����
�
c\y�p���]k���XN�*����}�_��V�0�����������-��z4�H��������=�wEP��/���e�t���0�����$9�&O�oXR�������.�d��U�Ib� ��_���T�����)'�N\�P��_�d?-"+����4���.hG�n��3Z����Jv"'�*p�Z�jB	s@{&����Y�Zg�����zy7��(I�z�c�()A��
R��:���L��&�	=)HIb����i������pO��
�m��Y�E�����BJv>�NR����y�H�>�7�v!%;!I��@�������������[dl����
�?��G��kUhs����`��{�oh�7�Fw�t�����6m
�^�BJ�������O
HI1���I�8 %�?
��^����{���t��qR�����6^!T��~#_��*�^�T����Q4�G���hM���&$��J>M��f�C_�o*���f�����@'��>/�)qVv�W��������:�[�~@J2*�+^�b�������[���#_�ZUL_b���#w��"���	�e��P�/�ALq!w���p����ni���5%ew�����������.�"�slB�.�!��4��Dd�x�I+t.�[(���XQ�W+~k8����A�zMS8���+L���x{�8�ddA%���]s
��`����hU��7�5�����{3i��HZ(���,(8`����P�m3H��T�Qe��E���W��~h�����j�Fi'E�x���P$�;p%,����������������O�X�i��m�fM?}���H�cSl��P��]��1�:���jE������D��YK;����q��J	�"qQ��e��8#xj��k��]�����������n�����(���2U��������
�7�z�d�y���Z�O����Yh�_O��/,��C�^A8��A�4�i�wI��3:"�����Q@�?�^B����y������`�O���k����������F����p�i �Q����'�P���/��oi�8��C�)
.d��(�Q�
,-c;r
�=_Sb�o�"�:���mM�v�\D�S���`�cg��{E��2�U�Yth8�N��+���z��[&:����8a����?��4U�<��S	k�����&�_i?
J_*?�m	�4�q������D�I�&��_DI-�A1v�U��Qb�������%���U_n_�F��7�_~k��4e�C�����,8M�����+��0���RJ����}Jcv��x�fS���]#�����T�^�m���C���|��:2Z��$F��8$�Q�kV��Zh��_B���������/�d'T0��_B��^��m�&o��{��R�9~�e�/WM�S/W"�	T��P�F�[z��vVh�g���xP�_B����]~6���9:Gj�����uU���(��>:�N���]��6N��#y����M1�����]]�74@o�
���(!���(���Q+i�Z,H+_&��}��3�V����_���IWm?$o/���@+`=$o/���*|j�CQ�[����c�d}`�%�N����Wk�>�X�!z������
Gs?����8k6
?��O=�'?����>0fx�,��R9b��'s�N=o�YU
�����P�dk���o�J���B�NR(�k�1��LiC�)u @�1z���v{��H��4]@vL����Xn7�\�$>��� 0q
m���)��x7��E��67]CO��s<�Cc����h�u�N�kW
��_m��j����������mZ����H�f��<����t�h>N����@����a�����;a������<#��)�����;�b�o����1M�d-'L)a�2�S��$��>������Y�����|W����.��5��;"��)�
��B�.(�`�Fam�_,�3
Yy{�%����D"[c�����h���<������_����C�l>"�D��n#�_�w!%������eY�_O���(/8����V<���9��������6�������;��d�����������B����Y�%��yOf���y��H��Q���m��I�79������	q��rJf2�g4����H��Z��;�����~S��m����4�������L�t�x�]f�u�I6���S6��^���V�S�n���3��0&��k�����������5j@Y����%���[T��SWo�����R�S��;�L�>�����������mQ�o��)�$;���V����`���)'�<6�
	KZ_���Z���T���4xI������he��O��ztnk������zEGJ,��M3��,����ny��;q���9Q�����o��8M=����$v��8g�mpJrC=����Q�S�3B�	�Q� ����Xu&�c�=/�$�wj��w�����:�%mfP�S�s�$�c���s���gv`�f�����-�{�Gk�kpJ*g����z���"�utQ�Z	NI%�#_bj~t
NI���G���R�u{l8���\sj@U��=v�!��B�<`	�)�]6%�zE�����j��-�T7JH5r�G���!P����,��U|�]�6J�E
x����nr��<6uCF����j�}m��/���N�Z��U*��jpJ~`#7WC+S���pe;2����NI���d����d{0�w yk�<UJ?�@���czZ�G���il��_EgpJ�9g�1upJ���L�?T�}�#�Z�]���\<�q�|u���#f����������������qZ&W/��5�T=%����W��~����)�y6-dT|�cC=����'��d���e�Q}���H�3`�'����	[��������4 �7�"������W�$����o.�2���
a�^L	�d�v�-���YP�J�y��#����c�����(;��uu"k;�����a���#a����;����Q����:����}/f�V�o���S/�����B��i[�6~kf�z�u���w[�r���
�&ud
�]<��|�{���<
v��UAWj�E��Ye��7�B���Xl`�]�*R�v�G}`J"��X�{����S\�x+i*\=��l��
�|h��BJlt�����z���.�+�������Y�QV����T���
�/(P%�J�����.C��t�juvC�N$�$}�I��x!%Y+_�����������^��?�������1�Q_�5|R�����c�����C
d��i�6S�x����/�jc�e".C�z���A|��+����6�x=U�����0�v���2o�q���@4+(���_�N���,�\������z����d�hMw��I��!_P����0GG|�FKz4�������/Eu
�n����#`��p���B����k����Ix�n��(�m�U�.����
�t��J�?���N���
 P0
�����]D�.I����&����Y�oD	�z�������	����]�u
y{mS�tdp%�!+W�7�Wj��S!B�����J
�p ����{�����s�6�I��m���u��w�'jh���uu.7�Qw���=(%���gjZ�Q�����2M|Z���]���E��*�
��Q�;J+�=�&�U�E�;C�
P�"J�0t��?��)u#�3���d�(#�������dl����C��M%s#�����v!%�������>
�L���*�Z;�v�������Sg&}����I���Mm)���xCD��r-�C?��Z$l�We������)YG
�
C�-9j�:7V�Tn7a{6lT�c��5 %%
d�4����R��`&�:�% %y�xa�h������G[~v9Py��n6}_l���������@��h�G��h�Q�iZ���R������
5�`���(��*�(���_F�����{#m�E�v�j�Rv��G�v1vz��4��Z��R�0�g��
=��� %�����g�
�$L����m��T����]D�&�*����s�E�[nUR��/d{ J*�*�;�X�E�������v%��M���l�D�v����= F-�K ��*V�8��]v�����G������85������-!7��fx��I�F�u*�O����t�E����ep�b^{ J^x�#��������~��(�r�0�(MnQ���J�N������v��9������C7AHKZ���P��Q���= !�E�v�vT�4k�V�|����TB��]FI������*�����L�Wg80U���L�u�-���(�7=�=�)�_�2JRC���$G`x%�U��>�M�e��I�?��o
x����2g���Jb��n(�J����
�ZG���h��@�����Y�	����'P�����4s�l��	��()���3�|2J�l�-�`��Q�:�G�vU��]FI"��\+�F�6Z���F�%��c�jc�E���cKUc��2�����G��^�Sw������_B�V�u��
1�wB����[p��U*���BW["���������-A�p����SY����kh�RN�D�."c�ny�j�kd�/�d7f���?t�����nH�p�?z������<B��Q�g���!���zx��)�M�,�*Cc�����cbd��0���~%��;�6�Q.f��|M��
I�)��0�F%����*����m�p�E�3=4��-���=Q���
�����5��?(%lcy����R2��$����/��W8~��A��~4��/:�Q��8�Io����|(��
�f�h�Su��P����G��}�-���r zh�i��jr���c�C�)��jPM<a������7Dw ��������?r���+������Ra�R�_L���_�U?��������	v���Y�
�z0M'�d����k2�.k�,E�=������Y
v��tMS1�y��r���<��V����9��Ff#��O�{(���+,�����
-��R����6,��K���z*`�������x{f �:���f7!�lx�������j��f�/�d��fCf:-��RZRI+7MT�����:1Q�����J?����^����h�������Z�Xn�����&�����������'����C]�=���2d5LRoM�
~���r�0\���4�~)%{���]��^JI��L����w����������y�=D����+�Sn����)G��Z�e����w��VyA��x���:s��������6���.#��Q=%�)-41���Qb��V��������p��y���w���[Q����pD�:L`eu��?z��y�1���m&���#y�Fj����y2M��J
Ty,��sN$���<?�G��}"C��R��~�:�{+�!���~�$�@T�\Y�=1�nn��	0����{Nc�1����#�;gx�D�G��[��Oj�TPx�t�5.S��d��&�!����N�O��g�,U��K'�Y��R�#K����C�w���-�$�v���k�J�')l}�Mkk\�{�w��&/�8p��CC���^d/�������u�4��
[��h��S�n^<Ig��)��r�WGj��CU��=��'�����=�������q�@V�!w�.a2�j�����p����~�h�����/�$�@���G�nE3�+
V�v7td�I��n�4Wu���8r���u��b�G��A����&Ic3�'�=����c44�'�Z��;�@�*x�S�m
�r����V��������L5�{�$O�w���s��;�$�=D�����p&W��#�$�!/�D����O������z��'��MGl�����:A5$�%�(���}8�{f"�����q�<����9�U�
�RJ	KHX+O���TXC�3t�$��'�PV=s}��B>�����r�mB��z�s���[���n[�k�ubdI7��d�V�|���Bc��G�IREi����'��/�]�������'I(�Hz����P$�k%�$��;��x����}p�'f��I
�:�~'����$T$2m}j:��x�z�W��~�d���V�
q4����]������^s���T����@�h��8���	i�LP��?O��?�o4"n&��n�&�l2B����6*D-a�<��\
����f�?�ke����cT)qA��!��Z��,�q��1
���?Sl�'I	��+-�bd��m��0�����:tcX����<�N�K�	>!��-Q�M�������Io��(X+��q�Ov�C�I���#d��B�;���!{n���7�U1TJG���bK�5C�'��1���{�������%D�����k���0��mUQ!sw��UO}4�����������<IU��F���Tr�'i;/eRT���'�Y�C�����8���%cb��R�zW:��d7@c
��,���i����G�n�Jd�i��N�40�|�]e�������Kt��h<�$�8#�����P��h/
yW���t:�B#�k�����?y��� m���Y��a�.�$�>W004yi\���0R�E��C���I`n>�T��km��0,��_�@F��$��2��G�������2���G�nE������B�x��5�����H��MR�I��Gv��������Y�=G�����|����.�d7wJ���x�(F�:��H��#���^�tl���N��%e�';�AK��v���Tf��/�d�e�I��	�Or�K���oc������K�{���#^����K�Jt�G�^�<�FU����E���N�b��n��IL�y!���8�uk��T1B��2IBj�lt����d0�'(R��L���u�����%�t��O���e�6��Y�
���s�h�u'I��^Y���q3��?�
� >M�`F��2}dnL�l!�wF�v�NQ9H�k�{N��\][���X�;C���(��4��f=���&��- 0f�i�]B]u�H6���p�Y�>��a��p�,�b��b�I�'��N�,��Us3�K�U���������<7�[���32��S��/��t��{�/��P1/�dmz=�qZ�w���A�pi]�0g��y����*��H�N��Vw���(M�{f������=]�%3��;�����|���T�\a���:/�$��d���~w�t�mPn�@��N(#����E���X�����i
C/��P�Bu�~�E�~���l:h��������\4������������Iv�8�s���=���t�.��D)A�z�;��W�����b��A����IP_�����d����a�q�$;3RAH�m��z�$�����S���I&A�	�t����L�P<���
�F����9V_c�z�Ij�o�W�(g�i��^���p�2���7�,�o����&u��i���i�K��,YO,�������H��v�������e�k�y�m�+S-�����!mv��D�OX�ObJ�$f�����N�������FOkx!n��^.I���3�S�j;������z��������.��r�����G?;i?�v�T�G�nH���m�������i~8���h�5���"T���I�h=o���V�9b	�N���X�)��S�~��%q��*J1�M].��?T!9WM���]�-�+��+^8Li#����_����)Y|a��c_�q�_��(�o�����I��*�h_�^'�#t5����T���.��u�&V��j?�y�$���o��(L���u({v����h-OE��)�3��2��^4���L����A�
.�K&�	!Z}1-u�^2�>�j1>�d\��L��t�7��v���oIWx��s���)z���Evd��u�+��#�����G��A���wK�^5�bo�X��7�������,�Gj�3��2XeN�
��&�:=0��H�.cTe�z��C��Lt�C����c���%���Eq��"�IR���GHZ�QE�� 4���� ���=�n��a�XG��"�nK�sS^�rP�<C��
���
��v/��F����ww�C�h;����%�~
�`��RqGF�M<�Bn�t\��~��%u\
8	p����F��Z����%���1he'~k4�Y�@�&�����]#�����2`�[*H�%��wF�m����m6y���oX�2�IF��}�����D�����p9[��aI����M�~�|-�#z&.Iw�54,�$T�xa����rIvGk-����|�(�v������h
������*�~��^~�|&^8���x��Zn��jB�����S���
�����5M�T��r��^��������� ^�v]n��jw���du����:�FP��K*aJ�,s�MLo�R��,���c������z*����\���u��
/�w����Mg��%����|�Z������:�%B�?������CWw\�dl��ix
=�Y�9mC�����K;���������e��YKJ2�@O�^���^]�
��5%�Cf,$��^���"���p���2\�(6���m���E�W\�6~k����UC�����=Mxa���	0���c��P�1�>8�z�rTd�:�K(�e��S!-k6N�jMG��}���A�M��P��R����R�A�{Tf��e������f������;Q�����n4$oq�����z�$Z���x*<���.��/���F�m�8P���q�.M�0AM��t|�pL[S�
��c����������@w5����FE�g�D��K
���f�p���B��/�Di!�(<|m]m)�N�-�N��<���C��6F�3��]S�N��g+�{
�u;�^�S�)Qpy0�������C�E�����Th5�
�/��q��kF����y����?FR����{4�Z*���B�1�d�
���]�T���2&�7\J�@1��FFf7o���Q�m;��1:�-U?��L��NNz!N'�L�����������5��n�U ���3_��P�1I7�;9��2BgKj�!w�^z;x<;�H�7R�]���W�s�g����r�:a%��k�%���V4T��|�U9WZ�p'�r|�"
���Z��mI�>�n�%@�A��(P�B����JR(J������S�ao�`��iL��_0I�����=�&�lJ�Q����5�
K����g�U~����G"��6�C�Q��f�a��g9j��EX��������=7�S������aI/�i�taz��!v�FK�^�_�!����7lD�>S�7��
�4ky_���<>)��y����'�No������T�%:D����JR�h�\W����k�6�)�,HGF�R20THTz:�A%���m�JVi2�����y"���y�$M��8�C�������*�KZSmjG*II���dIG`��7_�^z���x>�U���i�q���<}K�v������5�����+�ULc�2}%�����0��k�����]�HRft'�A%
_%5��T�G����	��J��G���ZQrv�G�)q����A`.�Gz"!�t����l_&R��x��0C������������������y��d������^*IB�Q������I/BsPI����"u�N��K%��_���T]$_*I"!��h����YG:�^�J����8L�����U-�h���S�t���h�V��w�������G�6���������6t�'�<U��]-!�$��2T@G��{���_
�Q�3UD�G�}����VR�3�	7�a�b0����J��*h:cb�.����sT����T4���s��gF��������������gz�U�I�{e�mQb$�C��t����D������r(����F��,Vv��s�W�:���.j���8aM��5lH��w��'�[�G���v����#b�]�_�����J���$�z�if��<6�	r��������=mb�<c��l���=����l�^sPI@�c���*�W	*I%��E|���+~�����9�$���8���g����d�$�`G��g���T��^��c����>�Z����n�(��*#�&�y��������Qp�8���vR���;���e%�Dk<���(�.������wt��+�U��v���c�`}������������n{t ����]ke!��>o���Q��l"�:��6`_[���xBI�1�����5�:��-�	J�j5e�p�*"���x+g
�*j���^��^�5�8���R�|��^'�C���U�#�Q"	��B\sy�����U�j�v���!������\�;�q�Bq�=���1�A�`s�gdf?Z�A��>A�\�^ I3�����4��g�9+j�t�^ I��T��W�@c�X���Xv��I�%*�����%���c*��'�}=l�C�4���
[L1<a���I���UN���Id-k�H��uiw��(�i�x��[�����k�h����	��Y1�C[��m�����vTk���+;E���]�v��7�
q�~��8���Hh��v�Hi-�����x�2]'oKv��u�-h���(�Q4T���|�z�l�82�`lI��F�$=CsQ���m�>��U/��3�H2]���H����p��\������z�Zz��6�p#�C�%�D�&>���~���<�M@��ET����w`�Kc��+�y����������P�X��|mS�
����~oG���n����>^���,T�/���I�e���h0����[�H����_I��������E����Dh����9x�FF2-@p��*��R��
Ey�	�T�X���6��lG�pGGu�@�kH�q:(�|���6�Z]����t�$�f�����v�$��j�c�XE�?;�w��0
�z����
���7v������B*�V�L����kd���.��0/�$��JmU�o{IX�:��:!.�dld�s��u�I�k�KJ�z�$���R���4�D���KM����1�A�TW;���*�4h���/�*C��\�>����?�����=FEq5���������U�Z�6�xE�>���l1�d��md��e���m��kKa����l�+)R���+p����U���I�����\��^&IpXt�B)�V��4h���|u���1����i�*q��j:x��h/&IA�NDG���ZYt�@Ck��S����������R���:F���$�Fc��l��7\2T�/8\��/=
��dJ���\�u���Cb��M8ytJy���Y���'n���M$AUCGUC�����~������.��Y�W-��I�c�	��������*����!�L�5���y���\�:^;��;�f��ks�~�J��\�Y~��MBI����!�t������t���]�j��R��~�[]��#�W]��5������z��g�KK$I}�w��:l+Z����MM�=��� .z�Fl 0�2�H��h}�B��Ztf���_�t_�������g���(�k/$��;���`h.Khe�`*��I�)�v|�]�k�I��
����I��`~��1�E�L������&�]�;���
��X������{�\��G��Q�&�gk�����t�l�r��������G;Z:K�A5�<�4��l�6��C��89�;��s*���I��B�������?K��U��H��9���^X�����o�$i��B	P!�����f�q����Z��$��7CE)V�.�$n@K������w��`.��`P���mo(�E���U�H��s?����M�������u�9��U<����,�k	��]&I�����c/����S��|x1I*�}��kF��L,h��gj:�dtAtx
���oR_��E^A��CD�X�����N��k�f�_�|5��14�%#�P#>������Z�|	����2�m�����\"-�3����mFV���
7�s)�/� ���%���S�bF�o}	��������a�����]����IJ2l��?��CG�������]~�>�T��b�Fc����I����6TI�T���z��G-�$6(!Rp�YN�g~b*���5j����RJ���c�|��C���y�T�P��$��;7�v���P��	|4o��T�=|�Sv$o>t/d#
��Q�w���*3��]"Ig�C'��yd�D���68q��Jr����Pi<�w*5�'gaI<R�ZJY���w*98(\��{\�{+.�"������,�	�K�y��LS���b�o���	s�U\��������;�v$o�Z.6�b���kf���z�*[�Y��l)x�"���C�T��XT2k��]"I!��-��p���.�H��3V-�$P���
�i��H���c� 34�k)y�xG��7;]���E�V��*�E���I�I
o� �����G���&�5�i �T}�#y;�.��-��+y�t�8��������kX��%XJ�L��OWM�!n\�Ig� ���������(:��Q���n?`�8��{n:����	i��}�!���c�@���������8`7�:JXN�G��Q`�����������(M���;q���g�����������{E�?�XP��5�oG�^�Y�P�0�DI���PI���zf���������~e� �����l�(i��6�IM�XK���	� }S�v/�������Zw�^��P�������l(%T�T�h�)%�M�� U����aG����	y��7�Fv�W=�Vu�����GK�U��2��0�p`��a���5m�n���\5gR�p�~p/$����|Q�8����!���n�8K��Q��?�8ZmbH"#�$41Pg����8�k����'�G��s.�P�Unm@��Ll���{~v/
t���4��|����Y����X2�VtV���K��-�Q����
�G��go���3n�?`���k�:R��V��xI����u����;N$���|
�<�$1u��x���'���[E,3�V��=M���B�Z��<�$��^��
<!��<���*��6qM�~/���KB3�G��c�Rq�(��#p��������}�S<������xvU��������@>P~+G�(�t|����p�$�'TXr�A�m{n��,X���G���~Q[��)y����X������?IJ�J ����02��yd�0�)>��Qm�H2����`���-��:��o=�$Q�n��w�y�i�	zM|�z����y4�U=����%�N�k������q��E���}G�^�]��P%q�����z��*��dm�_"��/��ai���c9z�{�$^�k�l^�����O�>�0>�IH�HRlM��%���e�\����'�d,Pmh�����$�<+�i����,�$Sc)���6�6i�(�����j�
����� ;J�-���3�0��H2�,0������������������5�����4,�#�`�>z���C�37?"�������[`!�:"�Fq�}<W���w�0>��B0�$�73v>Y8w$��QHq4����TG����g�9��.#O@��������X���zuC5�t
6'��$�D8.����@)�K$a����^8<��G3<Y��������G�^F���*��1�G���������`?���~�V�jj ���{G�����G�����vx�mc�Y������e�TI�����(3Ge�'I;�]I����^������������D�
q��������J~���K�v
x�h<�s0�G��4S	n������{�nj�,w{�P��U�'�	�f
��5A+��$WO�v	Y������v*�{c�SH��S�r�Qa�,'pZ75���I4�nf
�v��nY�Wl�G�B��%H(_����/�+�>�xF���q5:�Lh�%�T]�����`C{Io~�/"�t}������%��"]�,p�]�Z��:^����F��h���(,�V���$���Y.����HR��L���[��6�IC���X��r�"� �\����*��Eq:���/�#�%�4{���+{��G���dEz�(������l��
��M�J(o�I$
�]���P���?����P��	��y4;�`�S���w�sY��z��IB��fFA�i=�$e�om~>��"I�U������4	������_H��2�M�J��$q$I7N�����$�w�����-a;�M��2�zZ��'�.PW��$�l	����{���A����K��F\�
��V��`?���ci��������3�����`{S,��&��>��LIa& A"���H����P�
��H�C+�q��H��B�%��=����{�BO! >��g�N����Z�~t��,ej��rbG�g�� Ms>Q���l��%�8�&��M=e�K$�_*���ah��/�$2���tl��H5|�W�2�����F�Ui����K�VB���Cs.

 �|��}��Ib�i����{�L��3x��f�m7lSs��B�_�;�����|S��E{GRO�[�~�ZO$I�}��Y#�����)������h�B���}�G����k-����'Ff�v5_�O9���t4�qH��R2
QN�0�*Cb?���%����l�����"�4hc����E$)$���?��d�.�������y��z4AC���l�
n�R3�3�#/'	^���)��~q$ah�g��P�G�8�h2�-~�������]�ic����Y3��O�{��������qE'0jl�n�tg���@���Al��kx�1�[�f�m�tt�h�����GN�������-#���{�������w��A<t��j�����q$S��E�G��K�7*t��@�"�?��;b_�`WW"��B�0�hG�~���������&�~Tn���D�������JY+�5���>_�7���'��G��D
��14�%#S�#hA�jO��%�P@PPw�g:k#+f���%��gc���X������%��1)p���������Y�������'xf�3n����O�Wb����#���i�O�����;~�I�8I�ZB���qG�fg��"����:!����|�h0��
���v�8"����^��1��G�7���a�R�<�����-���~y$�N�W�I���~y$h��?((P�R���fMX1��
��It���-"����y�n�J
}^j��{����n��1{�(���n�u��I�>G�I��������H�k�����������}n�:�F����,�Gw��Y���l�k#�����*E�m
E��H��=�&j1!-������U?9����<�sdWi����.E#�q�n���k'��q�n����5ru^5T%KGRp��T&������%?��2���
���i���Q_�6��z?���#)S���\:4����HX��M0^8�yp
C��6z�h�Sc����#vW�
eV���d�v��\��S�p�E�@�������H�^Na���pI~�]
'�m��T�KtW��l�����(�>���w�xbx}7G���^�	�r�!7����m.|6�������~�l����J��P|�
Cs*�V�	�'��n��P��]"��JT�W����mQ>����+���C��O�$��7���������5x�.��	�{]���������Y�mK�*�<G]����&68G������u������k�aw&Q������s�0�P�b$��f#�!;Xh#��dF�����<����)A�Y�/��9<!���M�1]It��0�'�����4x��{2*G�H�g��(1Cw�q�5R����>�qo��8��[}"4�u��w�6��;�vxFFb|et��M�����}�:��p[�������pN��T(zh�p?+5���\����7Y�J^���T�m.��:a�O��V��6Pe���������`�L��#��8Q&	U�d#��Aq�n0��a^���^�h:�oT=bLM��#w;�8,��H��[�k5�P�`^~h�����4#g�����0J��)w��
s�;����W��i��&gR��P(�����f�v%k�!�B��������d��9�k�.S{c���JPE�����TC���N��8���Z?���9�3��=!,9��N��H[�5��f�^12-�l���6�J�boOu3*���m��G
��fy`�7�[���z�`,t�*����m�����h�U�zx�
C������qC������1���6���������E��q�����
�
��#���n.��dK��]#�'�DM)���<��{��`
�b���m���9h��6�i�Hv�pp�=��DG���R�s���4��@�2p8n+�.�HR�m �����QR-�W`s���.����4��I�[�E��mH�<
5������T��9�4W���E������y�$�C�
�����n�j�@S������]�����V��$�������"��#��_��kW$�3e�����;3~�����4��:�m�@g^�6�zF�P^h�H��h�������Dw ���+6-���#�nFR��`�F��LI����v����#��D���te��G2$��z�'�i�yy$���hWu���#qRb���1��H�d��O��L��L�T�`�[9���8��n���G���I�x��rq{.U���
<��#�U�YXZ8�G�K��u�Lo4�G�G�����*�	]�����y�mG��o#��l-Y��#�T)�ZX��l[�f(-���mu��#l�eX�*j'���	$)t&m��`�HR`����.�w�$!�b��zQ��V�i%L|,iy��r��Hw{�#�x�����jX?Hb
��YEQ��H�+��Z�����Q!]�g��K��:,	'3c"�U0�^<z�lm�2�GRY�����2��mS���3�^	:�D[���Z���<�OM���T��_���2P����=/��_�0��RG�^���9��k�Y*��0��6`������u�
!���=���o��F�4g:���K���V4��G��Q^���P~^�&���(���#�,s�4-�	�#	�:��!��W�I"{2���������2	��:t��l�W�6C�(�;����|8O(�6�:g�fv���)�VP��Tt��#	��O�������l���X������;
�������s4��0/�eG������UY�g{����m�f�������/���~g�!��M+���_#�>q\N	��ll��f:��!�_I�Ck�s�z�F���m�����g�Pg0y�g��k6X�v���D�2'�25���#�2u��[������KW*Sq_
�s>!�H
F�qvD��v�����G���$���[��V���!a�n���#�z�<�n�z5p�L���Z������=���4�^��v���n�7?Q2[�������&����u������H��WZ����{y$�����A=<��#)�sU���94�r���`L�H|�yq$��[��]s���� !OD����=����e����y:�}�b]<���^�������!�8���o0A�mw{hs`�`�@:])Y@�
�Xy��i�W-$^�]��.��,Z��
���H>["3KS���#�p]���	��z�;�t�8WMI�1��_>�!��y���B���������"����&CB���^Z3Z)���\��si�OXG�^��Ag�e�Q�����i������E"�Sp,����K"�pE�r��7���]"��tW�����f2B=C?���<���	��j�q�\��.Q�����Ai��Jov�,d������Vz����a�
1�E#1\uA�a�����		g����=�^���_����"����l\x�S���D��8to[�g�+R��]��Ya�Oov��]|�w��f�cs��B����D���=��]F�Si1�&�ua$������XG��U�K��L�TQh�����<6��jm�\XE�^CR6�m�>\�%Ej���R��!M�3��]I�(m������"qC-���sb�Owv
���IE����u���	U����5�43��VM��tg���@�l����!��n,�������d�7�W���6��
�XMQ,���CJ�E�2������<F��#��/=Qwp������I��:������yd+��u�(2�����H�����������]#���.D��Cs&��_
���E�4g?��<e��^�
�Q����$����,/p�F/����8�4�G�c:i��!�4������=�2Y�H�:�v_z����v��d��������L��,��d�t,��>^�lvX��<��;��1�7�8�[s*5�
x�>�5�R4F�|Zu��;��b��>����_���We�2\�"$X����Bb]wv���H��l��3��]:Q�������5 ��Rn�	$�A��H`�N:q����i�]�"DI~�7���B���x���u�s4��Z����A�J����tw
h�|�xF�!Q�ttC��[.�$�+2�;N����0E��CIh�wi$����Y��%i$?�]��q��+w]�v @��;k��m}j����7|�8�u���:l������c��Y7�u�q��L1���?'O-���>�5����N���R:�B �SZ2(��/����}�R��w�B}3)v���h
�`
�������\e^t.�R��m�#\r�R��7+�����q�M���	 	���P�����V�6��Q�������yc����}�$�E ������w��U���}$��J����:���(�>����jc�{E�L9��@�
���w�}��Hwl�S���~*�+~��H����^�K;����"c���DsF�6I9��f'~�>��Amt�4��$�(g�rW�k��y_$I������c�N��9�cY���+n��w��g�=����#%&3�
��h���������U�1{8hu���Cc��?�J	�q��J�d���%�f�9����u�H��vj���x
�M���v��]��x���[�=�����0	�6	o3���+�T�ml�P��F�/�d��eH{��o���ZG���,X�����{�hzS� j7\�����*]F$�u�Q���/j2t�F?�}y${���pt4������IK��@�	�_<���g�k���Gb���
��a�N���8��2&~���8Rw�;Uo+��?���z�����Qi�m�hB���E������V��U�Q�j�������k	~���q��Ej�������z�U�����M=�+U=�V}F)y?_��|`i���^�������3�>����[���6����X��t}��S'a��r�
rn�S��0c�~O)��X���0��dS��h��C!V�T��s��P/���S��:;VC^U�����?[;n�Y����uilF��Tv����Z�FY�*�j�(���pq��{m����F��)G������o&�R�^�<9F����$p�33��@�S����b�#��14��8�ac%@S'p���q���"�Qs�~1If�P�������V�^��������sv�=�-���f3��;�0��'����2��d���'��k����E��}DoI��gp'��w�#.�l}���X�����0�((�y� n��Y�#+�[�K�n�<\pqv
��v����R�3J�I�
Q����w�� �����O���}��?U};5����;W�\.IS�U�r����u��M���^��yJ�5<6H&C����^wU�h�*��)y�B�����a�{�#"��?x��5xw4�k���C�S��(�
�c�����\����,�����w��j����W��R��5}�Mk�v��=��r�?�p0�`s|����I�!�R��a�O��l����z�$���*JN�Y�L2�����&���#D�mi]t������I��Da�����M��q����A���������)�d�Itx����V.��J����lz��dn����I��-o���f����	B��.�K��c��02��j���	�����CU�����D����f���3)����fL�>ZI.IY�;���M.I����}�=z��%����mt�X��'Jj�r-�w�L�y�=\I���SBi�>[
���>#�
`l�m��o%������wp���K��u��K��6�$^�V����=*��W��Y1�������
��;��^3�$1���>�Z�^�	*#W�YI.��	,�3����}G��*%���,��D�|~���3���?�#V��[YQ�#�;	�������}�9����C������_vA"���$������[f�0oKR6(|��6�_}�KR9k�
}��%�;�v��}^,��AQ���[I,I{�[���g5��Kb$���>�<���������=H���+���Ci�[9��j�������#=O�Z/7�c�����<s�3|�NiC�����������k^,���}�02�$Dg�?+'���$��
���R'����}`���'t�������g�k&���_��l�M�����D�x�X����]h���dfy�P�*
��C���Mr/]:�mi��$��l��J�UgBbI
1|�������IL�Q�"�I�}��������s��$�-�wf���L�����|�g����L=#������g����m��QF���bR����&���l������$�`t��\�����+������g��CN���Hw{�P���k�����&�%	�Q�	�hw\��E"t(��4�
S�b=OrO\u�����U�_�&�zF#(��54M���z��JMI�������^7D5����g�X���pC��u(�$��k-4|��PbI�<"r.�6���r��8hKV�,{#��uh(zUH�'��9�O
��a-�B�r��{����\�F�p�"
��l�-�X�m_����5,X�R�~N��)XZ~h&AL��g(��[g��XJ.�N��[�>R���Q����o"�8����~����Kb{Da
Q���x�$�c
z���H��#���>��%1��Z��N����Y��6���1]0I��3�����p]w�B ���<rv��>��>s�_�T2���O��h�����D�D���#[�5x\	��O��o6�]�
�PE�Y=��,��X.`I�#s&�
�F���M���$��8���g���F�C�4�����i��X����%6�x�J����n�3��
nS�3��k�3�~0�����\����u�����NM�vY����g�$�����15�!-2�T�����l��q�N�v�1*el�V��m����k�[�>m/��U�"`<C�y�x}L��*�&�������fS�ud����.ff1%��E2�"9w<�*5�5�uA*����Hck��a�@-�329���N��8�gd�J�G���m�(}��7vU�N3�~��N�f��E����O�F����e�|��T���b��T�x�����*_|�O2:2��I��U/R������1{4��ud�;��z�����b��FG�Xi2.@���4j�ho��0(��4L�x����o:�}���9��5Gx_!��'	S�1nkL��'A�(��{F�[��Zl���9��rQk���Tj�IZ�<Z��O���w��[�8���S���5�b���t���Y�7��QO�����R��*VC�Z��#cF�4D�yHY?/D�0
��U�E	:��B� 8���jET����0!.``;'B�cON�v���#iA'��m�Y��TC���n���������g�I��Fc��h5��-2WZk���)��I��~5�'�V/��;�������x�6X��Qc��f�������@o��jz��q���q����u�����MY?�K(��m:�q6�X��=�Q�+Nb,LG����u��z��:��E�+��u7�n�q{b}`;Q�@�|��*D�9���3���:9����;��a��eu\��,��[�dq�|���k��I����Q�;�F����:����G�����y��I'������S?���a�CP�X����#�+:�����2��>���c�N��g���V�8k*��ij�)��N���E�N���*��??����(��������iWE���wcR���������]�6=��/�0�rYr�6�
��)PR��aW�Z$dQ�I�l����4w�^m�d`��jU�34��sw4��g���J\���f���=kJ���PknM��L���c�g��������/QeGT���
:�
���)z��V�/B�!N�r���Ke��<#��-�U)2�]O�%-5��E�#x�����k���������N�-%Ex_K:����P�jHn-U�_i4�d�c���P�epw
�Z�S��x�"�������19K�A\���]@�pl�	��M����������0���R?��u/'�H��O�;�c#c�����)��=t����n+�2`��j/Q|�3�N��S';��fA������,���^����0�W\����}qY���VK������y�|�u����a�C��^�
�5$�e�U����v%A���$�����y��������a~�����_�T64�k-C��5��E�=�r:2�I�G7V���{d��������i~�]:I0-P��
���w*i�_���y�zTh�z[Hp�*J�����m�I����f���z[�	��et��,G�DZ
���&�P�s���\5%�&V��u_-�-u���g��S�����n���H��p�{GP�E�B�,�jG��!�Iqq�:R����h�z����k�u������|+�2i��T��D�M��)z[t���iN%�����!
}Z����C6�����7�^M��5]K���=�P�OD��>�����j�S������)�>g����w�BM��Q�6�vd�Y��|("g;�w��uGa��i�\�SnGK�oJF^<�oPG��14W%o���c��Ob�'6�&C��6�gmr���f�� ����jR��(���D����Z���y	�y�!��s��[����_���8`�B�GK���
����"jW�<����vT�����p�b�;�w��9���-�Rz����N��c�J<���s*���T�iO�)�6��������/�RK�����
��t>���aL��-g`���������������@���,�}�G��aI�MW`�#z���K6�g��GsD�a��4=u.l�G�^�a���4���\'�c�b_����J;���ch�����6YTNr=�&�P���^����v�$�i�-�i�T�|�����(%�,S��'��$d^�l	()U�� ��m����5v9*��h8P���qK��:j����i��ph>���]������.-U���L����5	%��`���`��$�$�@"K-"��S�����c�2Bk�([K���~�J���(Y���F��p�����8���Q���%t�U���z�Z��`5��T^UgaJz����@�b'������U���'��]����P��_�{�i���ih��=��e/h���#���h��p�h����jI(�.�&������������YS_�%��_s����.� Z4y%q�1w���~�v	%fZe8P��7���,���:_hJ�L�-�VN%\�>�O��������K��HT�*|��t	��$������PR�� ��w����	%�+�%�����`�FdYudJ8�X������'9����������g���^�$��L@��G����LHBI_$���J�PJ����vW�E�=#�B*���C�Z�J*�a�!���=k����mu����]g�:d�EDfI(1vF���,T�Q�%���m�PB�Q!��w{
�
S����>��^B�F9#`DXJj�h�+k#[�Nh�D�R'����j8��j��y�kq���Y�"Q>#��H]X��������6������j]�]B	����**O[J\��`��w���C�1�>�3������5�����&���n��V�����P;�v������.8���$��.���]��nI(�*���W��PJ�S`�d���oI(_�@�����$�(N�B����g{���UO��I+���C[Oh�}�z��#h���n�*�=��:2e���qc����9{�O9#X�*f�%����#�P��]v���`=
I�#�������v��	�=q��.Q?����u���,�����������"���nS�]��vd�k��WG���	%�����hI(	5E���U6�32	%
j��~qc�NB�w�H�+�M^�Plr_�5�h�3�SJ�H�n<��3Af�@+�g������a��1�����{��yk�d.����wa���+�8�]B�#zkM�4A<��lG�0eh+�12�m��lM�_�X��PR���h����Jx*$�|�6S���
�������;�U �����;�r��������*���u5U*����S�H��~�7,ap�r�������@8��k�U����gh
�M���
A�9���P�z��s��!�%��(p,))\�PJ�����]�+
��R[���Q����*���^F	�22J6��e�6�+��P���Q��!i!\�/���D4��F�]F����A$�.-;7���~
���D����6�UC�d�3O���c6Y�)�(����+~W� u��9�a2|���f"�����s�5]����u1GY��\�G�^�`��2��9?��ZIG�s#u_u�-���m���
�K(��4���t�*�zZ�cUB
	b�114'���K��`k���]�A��F`O��/�$������t���J?����m�f?�u*����h�aWy���]H$����h����]�M�1~t�o���bd(^M�U�X���Tzi����$����Ek���l<���%Ts�,�T�_@I8��2�F��l^��UL�kKR��B������� O�v�U���5��l%Y���7���(��P�����p���v���Ef�c���n->� Kv�_�E(��X���K(���M���J:?�9h�
�\��7l:
'��7�F���WS�M�^����11��:���]WC��i b���{7I�u����������F�S���PG����P2��k�b85d��d���:
P��%
����4���o+���
��H������ue��-)��j8�c��}4�Q'
�*|�M���V����H�X��_MZ�[pz���"5yZ�[�*u(�&C�d~����Wj�/���E(��U�6�K(	w����\���J��
�p��H�J�2���QC��)JX�����c]����w�QU6��[�,�\��O���=���?���y����
��G�{:��j�Ae_��h�
p�_@��D��3�_@�h iu������J����3��Us*u������%E����O<�=�(Qh�V����/�$������=�%"�������0L�_��r� i�!c=*�����n��Go����W��c��*�Cs.MC�!Q!�Q������,�D|w%��;����j7��s�=�yz��Q�u����������q��T��-��Rn�n}�G�v���\��!�9jw����YQ���_>�&j�y~:x����}/6��O$
�J��C��|c������_>Ie�����5U��O��'�g��_|�Y�|��{j�?��^�a1DP���/�`Kp�{EW����~4���
������������['����R���hu��o&�R����T���|�80k��b��^��I,?h|��O�>aQ�����S�?%P�]z�|W����#DK��s�hPnX"��4k��<
����w��
-�g��i�nF|h�+�u
��`L� �8= �]��-��[�JT��y���	������$�_>IP�t}@��S�|g����14��j��l�*���'	C;
�Q����~�$�U�V������M���2zQ��m-��V�[�Zsa����GKo�������.H]]�s��Q.��u�����k5���{b��y�B�����k�6�!��5���ORd�d���^>��8�Y����{Q7�3�	V�O�|������
@Z=E�_�Z�`,�B�����F{��������GDo��U�{��^�G����c��p����
���)/E�L��po��X{K��s��>`(�����G���>�u7M����;�yU�r>��u�����*�R�=K����I����"�`Z��/��60��.�4���O�XeF�H���$a���$?
�z�I����X����g��yT���������1\�gcW��t�!�8�_���)@��P�����L����>]���{UN�3��)�O���{�������+���,2�	��������G��������[U�2`4��=|�:�a(^j��>|*��e)Eo��"��{���;>��Sg�x����bM>9Dz%��
���������r�c�E�����x-�;@q�������F����y\��6����������h\�N%�7���_>������F��O�i4������Si�0>���n������{\�<Z	��a,E�g�su���0�R��uF@J@=���r���5\�wQez��"��l�tF$��i��6�`��~D�HZmQ !� i������U�@�����IB�5���.�$8���N����J��ihHv���_>I7h��m/����O2�����P��
��u�?�T2`S
�g�N������'��DGSm;�����v��2��zQ��F�~{P�_��K����O�(R)(\C�[�x�Yi�#rfb6\_n.���7s�$��a�T6���_<�$�{o�0S��Z�6��S��L��WE��s��!����7�T5���?i�d�Ob3O������N��=�/�$:4)���	�~]�q���*�)���D�|�Q��g���h�}k��<UYZ�������A4�p�5MF���;:��OKN���{�������������$����~�$�h	b����h�!p���&~�|_U�$�Z����!�Mgp��}'(";W~�G���P�0c�k�8��mE.�gF�|���'�
�xB�������M��6T��#,W=�����(�����%�d���<����1 �!�8��,~�^�����q�����������`$����<Pb��i(�m�J	(1��/��>��4�O6�ts	(1���?����J|�k�@~�5P�Ox�$)��nO1�D��&N}#%�m[��02�����t$f��:s(�����*w��D��)��@��Q�����2�GJf���4�/��#�^�����
�6�����q��v��5�)q+J\���*tG��[�@����>��=�X+��L�xM[��p�b]��{1Z�5��s�S*Kr�	(���<��ZU�	(i
�N@��E����uc4�6p�� �}k|��[�k���TR�&��U,�{�l�x}$��+'zU,#%%��������/�dc���3w�@h-~'NM��k�8m�F��#�5+�WCE��5>{M��rD�h����H���kB�������9��*��L����O��H@�2��g�����=sh��v��\;���������kb/���N�����"�H9�l�"��7�5���;\�bx���^T�	(���\����q�)@��k"�������#�RR�P\Y�����=�#*�J��P��r�v8AV�:	6+�C�XW����� ?��1����`GTST�	(	|��N�
~gJ�������*�H@	\��IVm�3F��C���@[w������FF;K�����k�u�N�'P|e#�e�G������3��i��	�����l�z r]@	WM��0t�L����>`oj���G6�Go�g{k�&t]��ZR���/F���P�&��������~��9���h4��_�v�g��<6��_��M��	�_Qo�HB�3�@(Ac��0�E(����D�����
'�V�{\��7�
���W�66�,0[����Q���T������Q�w�g%�n����=X�2���O$%�h����wBD��<.�d���m��+];�Q��l���T���Fk�^�h�Hz���Y�4.�$t���DDc�O����N�����I�-N�}A�SmZ#���J6�1	S�.�D���nt^B����ma��UO������M
:D�$�������{W!/�r]�4�7�t��(���8��n��g�u�Uy<��~�X[�j�So*�w^HI����Un�U����P����Y����d��;���Q�'{8<����qk�������(q#K��Me�tk�k�W��`t�id^�v��a�P���h0/�$Z�)�L�c�	)���#�&�E����!�=�
�K/H	���cqQx�<"�~�A��t�ozl+5[�	Z�<"�j��59��|!Jk���c���d�����dh�����:�e��	w�
���2J[AM�TR����8����������u����pj ;R�������37>�3�����v�����6~j�m+Hj��S��/D	KX'����v�}5�����E����:�v����~�7��Q�r���$�
tP�f��t�W�t�bk�b�]�U����w����V�O�c��[�V�Fa�P,�RR��m|��q�kY����5�&����"�!-r�c�l��G��lB�tl�Y�=%O-��W�����{A�Q��Li��k�p	<��Jg�H�3:�*G]#�~��2J������]�]�2J|`E�����S��~p��x��(A�BR`�'_�vTj*�
�
5�2J�����]y,3%-��z�N
�N��(�B���O��h�tl�(\����ha�c�FZG�a�����i�M;g�$���z�����]`h�4����:�
��2��e�|��@�����([_*5N��(Y�2�����2J�FN�@R�K<�]{o����chN�Hb������������ 6��?^��Z��<�K)����Vrh���m�j�|�D�<rw	N��"��L�#w{��bs������������c`B75���(���5�#8rw���87��j=��5�\:\��?5�n������p�$�J�X�����d���N��.�/JI���\:N(�R���9�;��S&����
��]��Lt����h�l������=�_#�U1��}w����U�����;oW����tl�Z
�����k�c��ZC��7ae���KQ�u��1�����-�����YL��z1%fo����.�d ^2���*�oL	���U�k������p'���zr��HZ5����{�`�J��OM�NM��#{�`M���� �h��Q�0��A#�=q��-%�`5ujl�X��ml���b���d}&+%;������Z:8��)���U;��@n����>�����������sU_]g����=*�nY��1����N��������ln�����%b�������h���7��+�?��KG�vS�dG�Wg7����/���M	�T�-"Y]���1��,A�h5�R��0��=���R��hk$�7[�t������w��ME�b�������Ox`hV�|���'�
g���+ox9���>���!�CO���=��
��~5G���P�:=h�\G��{���
�������p�q�x�]�*2>
���Ya���m@�k?L��+U�'��Dw���S)��(q���E����-�9�U���^�
���1�����=�!�A3W��{G����������d�2\���l}L��n���uS+�:��c�	�ia[W����l���04�R�|�:��?U�gmF���h�����!���9�j����+Z���v����F'D��n���a����rJ�_��E���_Q��it�dJNI��VG���;L���|\G�����i��Y[��f��n`��'������t��������R���*~)�j��'&#����FV��?�0�j�Z){{Tqi��|�+eo�
{rEvD������E�d�.�p�qO�:�����) �������:�wGS�T�F��s.���aE�^p������ku/����u��s�h�.��C��Dq��f�0%
B��`h��(+����hbh������*l��{��b)�*����~Rb�Hi�q7�PnV��&]�tF���:d��\UN@�f�{DcoA
�ApbJ���@b	'���)�2�����*���)���f_X������UP96���gZG�^��\K����"~N�����9�Rq�S��5}F�����yR���Q����6��A#<*Zqp�F��:�yM`�����������X�FC��H5^G����v��\����(r�Z����#_��6}l�w��r%�8O5���.�;��
g��
����RMh#����"J���br>������J��Y�{n� >�����y���$�(zz.�mqg	%�>�cU�r�jw�����@���~]BIu�j��J���\�`Cm��.�$���h�,����d���`��R��u1��a�s�y���������{	%��P��,�J��b1G��T�����@�(���#x�X��P�/��q���(ta�X:���G�nS��cb��Q��C�PB�I�|�}��V4�58�]����bw�LZ�,���G�f��I���|-gIj�����������5t�B�EI(������b��P�4���m%E�mJ�����H����E�]�}��$�����Xq�;	%�$EDFUw��Y'���A�5��3��q-3��lOs�J�<0\��}&�dcdE�zS�c�k/�L@'�'�#q??�#e&DKj�<s�	�����������v��(��y��A��D�`;����#������$ro��o�����;�����=z���MJ�v����'��!*��4@a����CS����V	�,c���(��	6q>�r�Q�������;%M{����[�}�"J����q2�G;%���P�2��Q���/.���c_D�#b���T�������� X�1r�A�
���NE�}4����9r��5rR!�G�����J���{(A��sH;0Zt�������&gA����(��
q>�K�D����@����p���}8�&O(%���W�M;%��-���a&�u(����D=�&�����&�CT�jt�G�^�C�!������ZO��^_j8�Q�0�:��/�������&�\�HF@����}%%u��Y/�+���W��e���4Tt�=�B�cU�����+zC����a[E�}4��:�,�_�>�#a��[;A�@{�}�_3���
��$�t�B��q4��}0L�y4E���V�%�S���U���d�}��o%h(���(!%��$P��N�v�6��Y)������;vm���-B�#_�A�M=�}2J�b]�{����������$k`q�g%�W��	��R��NF�4�f����b���
�	�	��N�zTz����f)^/
��x	�����()��0c������9�_#ZI=���vm{����1�h��k^����6VCk�}���QU��Xw�)��X�Bj�jP#��O�K��.|��\��=�	�m��Ij���1�P�s_@I��[Y���J�R����1RR%
�o
%7T�tj��}�}���}	%V�$���t	%���
���r����
�&���#]�F����y��CG��:,����A�xF'�ndv,!������"�z}�����
����:v�uE����$�0�w
��V���M��^.����*�{��9����iN��l9�?C/������[q�\�lu��rO��j�e���0}����i�h��4,������?�n���7���M��~���T@��K!GW��1
�M��Zx���fQ0����xd��	�4�|���g�9�����8E��S)}��'�N���>C��-���T��u:����<���6Ea=#���T�*a��CG�q0c��f
7�\�v���K���m�i�p��i�$Xj�>�')t4.F�R����I�>����;����Ys�b���>E�=�R���RqQ?a]V����#�y;��Al�b��;=�A�!F���3�������:��_�l�0���As��
'M�M����z=��y���>��hGaz���Ie���Y��
}L��1%��vv`�p����M��kt���N���[�/���$Le��}��>5}���u���dz"@	����)sC;_>#��K�����>���r�$��Bf����Y�����C�����E�v����"h�2�����zD�����LZ�PLP`��|-����"����9�V�g3q�C�
����:&���4X�8��Y�k���0@7&�����3�6�!h7�P�]��hc�6��FfC�{CCP��Rc��wU2��U(��3J�v���p��34���
1���.�J�UG�g(�R������;]z.]�G������2�Q]6I�0Z��&�pZ�Z���!��/��.#��������vn�*��������>�#p�k�gQ-z,7|��L=�tB�+��8����kbJ�@��IZ���'��j.��j2�_0�g�E&zINC>��h�������&�f�!;v�k��:����'�.�(���u4�$5�IK�V�3q��� 7��	�����}L��)��#Y��1�7�XN�`>��J��R�"yNw�$�<_
B��Hv���>�2IaVD&!j�I��H���h�I����:b���j���f�N��U���j��J�5w rI�������_;���M2��-ah����U�.��<�pJK��R}��Y���$�i��Q���g�����j��#7<��$)��=����w�$��j��:���rw��/;Qy��R�n�����{@�I�v�r;Nq�RZ����������Vm�O^�8���)�=o���_�e�<�
�<X]���U���Zk
��O��������zS2b^���J�\�jB����M���B3���I=���^���4������w���j
S��%�<s�o���>C/2�
����:�g�E&q77��|���\2�
���NUS�6��br	�N@]5qT��^8�����k�B�ghrn�cC'B�~^����N��"(��0��]��jD��~O�:*�^�V���S�h����������1�<#�g{~�@�O�����)B��x����lt��j�\S��gz������
������3����srM���F���[5\��M������9�z�$���?����3����vbpec�\_os�%4�U�:!~�q�O-	�.6=��6����z3���M��zEQ�A��g�S)
�u(G�Gk���i�A5}�)z�L~2���KK����i��B>�9������}`hnqQ!�p���)z��8#�
{��)aW^��f~J_�;?�w/�c�����:�R��WJ�^F����8��m&�0,s_e�'�k���a���z��)z��cy��vK)z�r��`!�j��/|�����E����/�
�Q�
1I���!��)z�X9Sox3l�.{�B�>�j��%�-Go��u���]M'�����zT�^�q���?�S���Cx�B�'�32������ii�O�W��\�� 7�h���,�B�d|��z����U���T�Q��
!%�T��������M�Hy�>�P+U��^#�@�n.�;��(����	[k�IJ74�,��X��%�$%���j��b>�w���!~��^GF��_���� WS�~6�cjS�O
�z�*a��OR��%u���K�C�T���\���'�T��s��P�Z�j^�w�h����~��h�h��U�2��R������N��vSGn,�G���u�uhAC��f�������[�#J��&<�u�6{Mswc$;e����g���:\a�p'�E��IBvT�,f6>�S"���F�_�( �$�HJ����(�����3OC��a����	�i�G_Z�M���������8��L������y��p�$�0��bV��Q�=�EP��D�)���5��C3=������kZh� G����(���bSl� ��v��`�����gh�I�x.!l�[!����Q��!,���k��?���P`d��'���������"��$�C+��8���m�@4�QU��>C3
����(���rrM�����(�MD^�����%}���`�!a;b�s�u����&fp��'�vvM�E��0���G�jr	���q���M�:�q�2y��r;pL�T�i(0`�:��T���l��;xK*��$l����Ne���5�~-�$��R
�,���L*I�bA�	~�.e-�$�U�8�-F��H%g 3d�,lI%���u`�4��J2��9�X$��BK*I��h�����^����~� 4oI%iZ��1:�v�1&���u�/\���I�Ku�)z����/2�p����#�}mOU5�������_�����q��������t�z�eC3hi�6�6~��On\���z>U=h��Ay;����Q}�+{��_����?C�{$~Uw�v���V��H:v4�o	%���&aZ�x�	%�g:~���
�M�P������;�[I(��<�Z>�-�$��	�w��.�����h-�$N��DpR���J�NL2��s(�$H�:���-�$��B���9���v�(���U�Z���=P�r�P���]�y�=��p��d��&UZ6�4m�1A+k
	����}.���a��C���I�Y�BIZE�u4�,�sm	%�<�wv���<s����,3��<sh-���DL��oK����}<�����P��t��M�_�J}d�u�G,"������9d��Q�.�����+T	I��]�*��BIX�	���$>�	u��n<���<��c$��AX�]�`�oG�^{����A�� oG���c_�/ma:���3�H�>�#�5�z�Qe�>Jg���cVC7��q�	��(	u�>#������#����H��r�E��N�}#]#<]iq�Q?iG����P��`�����+�E�e�chG���v����##�I�I<0���\&I��B����!�$��Kf�I���I�D���.����j���d9��#���W��������M?�����=�"��s3K&ItRWw���en�$����80�S���z�������>����R���-]�%
_��`?h]�S�^�y�Om0n�2G��hK�7�$��������������9N�j!nIRQ�O����t�$�j���� ����U���
��K$	�I���=t���"��pF�{C<�D�F�E��x�E$�����$������h�����~��w��96E�����srh���#d�(�
7������2
�aRmG���4~�|p���9j
���M���z��C�1�XZ�hI�vu�����]�v�FH����9���U��p����M�kvy$�3�|��Vz����a�4���#	��D������]������C/�����)��������Z�cT��;�l�}b�@E�qKwv��C���;����l�6�?���v�$�l�Jt� l�$���J����*$���w���Z��	�32S���;�Q�_j�~��V	��;�2�YRH1d3^`������|���7x�:�x=��e�F9����G�1�{?bIJkv
��dUj!P�ni�.k!�������,������I�l
�	���k�i	�3�=C������[A���I2~���u��$��WC9����$�pm.����1K��
�D���[v���c�@�Ce�:q��
���H������yh�SY���[�d[�$vq$�v�Lr�#XA�3�g(�U����D���mq�e����j!���v�$�;o�/���%��	���fl�@*Wrw=Jd��f���U������7���}�E=���e�DC�	�����<��PW3�N���]��Q%��e����4Mt~Cs.����{r��\��Q�@���Z6�l���:^k6����[k�F�O�"I6��Hk�I��V��q�R���;�hk���S�����[�.�3���������]�;���)A���������+2+\	�������H��w��^�iK_v��"f>�mz��e�������&�����(���h��7y$u��z���5{�H:��I��
����%��G�'�
�zSO�y�?��5	]�q��������a9,cad$���}��E���e7�}�PR�.�����4�k��F�B^�E����z��*J��X���[6�l�E�a,gj4��S)��zU�/����H����oE���.��pRf��d�4g�0�c_�a��!a��h�+W�R�n�>�F�%NtuxFU�0?��d����*]I�`�u��xJ�0������d�
#&0RQva$��������T�r?�@��H�av�F>a*@�_����o��a��h1�q�I�������Rxm���b��rp�`��A���Q��p��J��G���W���@�9Z�����_M�>�W��mSnwX?�j��7���x�s�J������X����/�Si6~6(��`�o���
cAC�0��N% ���h~�FFZ\��,��Y`h?�V`uAL�	��`qc(QGE��=�?{�%��
���+_U?���=�]��+�9���n?z���>?���:�)	�P`/p<[x3w*U"qj�,��"�%��%��G��iG�T1��n���R�Xe������L���U���{����1�����l�qZ�b+��;j�s�9PoM���c�u��ycq���~q$�5Q"�N���H���
�z��~q$AE�N���V�=q$�3Y���g�h�����(��#y�k��g�jr+xHG���8^Si	XO��E;6]
���`h�o�
���ow*
�pU�.��#y>--��H���T"!�����qq$�dnq�JG�`�c���/�d���@fr|q�#ih�������2�aO	U
��X�R��*C8���	��$��J��6B�$p�������T��x?������z*��q'����v�Y����Q/^0��]+����y�u�*��mD�����O�G�@��s�qq$^;c��-}3/���!�I����������c������B�px����#1����}ah���h17=?��S�.�C��0;���&S��W���R��M����=�M��+����s.����h�bl��{�J��Np��R
�?[W�l<�()x���+c�5���#��*jE3�B��4�M5�fF~�I#�iz�QH��������:�*:^4���V�rR\�z`���K�ON�U�{�d�
����H�S�dG*pTo_Z0���=��<�����g�
��
�>�d�,M��l�X������aY%shV�������l1���7���t|6	#)�

�m�Lh�G��Q&���Tl���W��I%{��wE��!�k�P��)���s9"�L��@�i�F�1N�����h����h��=�p6Pu�af`�uY$�_xW9k�F�lk��l?��X(�"/�C�IS����5<!��i��daG����$��p�#Q�g*Y�$��5>!������.0�5'7E��m4�$�`��+0|��h��c��f�q�s3���B@W0�V��H�����{'����12�H��3�C}��&)����5�2]�(#���8;��"�^���W\�a�M���kBA��w��E�������f�4
��
�%E��"���5XT���{7��XbZi>^�n�
]4��3����Zw-��-�,#����U@(+�� u�#u���rT����r�nmC)4��c�u�r:}��~��E2�������%��6lg���~�J��+�@�Y��vY$��&��a*<4�|Z�H�NJ�W�Xh�y8�a���;~���D���;j������H(�tT\6=�����A@����]��$RR��5Ou-��h�.�(Nq�H�f��Y*���pw���m��;���p$~���~mo��-�q��H:b�
F���d��(z���`Q=u��!�CG�����=1d�����>���"��}_v�Ki{WVP-:U]��H�5�kF�YU��es�����&8��c�����������E$e{��o��J��;v���Ip�C�(>:������b�9����r�m�����.�d g�X�
{�,�����C�MO
�h���E��u�����I��������)�Vpz�i�^��-����[9�����d��s|�H~�U����x��#���H4�.�d#�Zy��}@X��hr�[�Z�,����^����V���_����[���=����;�ulG������`��/�,��q�����k�M���mBy�*�X�H�d����3Y$@Ty.���E2'���u�x�'�����~mK��7RLtd�H&K���BTJ��=*��>�R����o�G��K��L�-����f1������,��`���]	1�LO�>�������	m���k5h[��"�NI���.;����L�F����z���,�  �>���#^��X?���A
�g{�k�#	��+3����<jQ0q�b�,��5/��\����,Y$�A������EN��p��7��]KH�JC~�����m�&���kWF����h��#�����qvd�����<r�M�';�u�m��"y������
��"�f�T�c�e�S�����S*k�p�YfK������E�����+vUk�lz�	2�T�}!�`�^���*'(�\��]~v1��"oJ�������+����j`^p��j.?�������5;I$�#O>����!�[<Kx�g-�
���8`��zm�����&Z.Y����MP�����kO�����Y����Q�g�'�m�LM��;\gvXz+��_I��6�a>��y1$�DF;4?nM;mud�G�KM�v1$AHR��')��	�s�����S�.�@�
#�R��]vvM+������v������-��T�K���\��@=����5���-�Nz����#J�7[?����tB���#X�^8!�6�3A<� v�O_v18w���T��`�M�>��TQ '?�����quMR�~!$�����V��_��������)YC�pX,s:�l���G���@�{�m�F������c\OSv�Oo��\�bh�G�-��������5�	����|4�Ql�a��{���\�}�@�@��)�����TQ����e�� y�H��!����/�B���r�R�"�0�l,��`���6��J�Z��G���/,.����z�h�-O������_j���f��]�,�4e��P1k�N��VY(�x��(��h{�?`-���)�F��LhSUOSv
��r2���|M���y�r�;��u���B����/��+,,�����H[v8�^l�O�����,���V+X5���=M��
XD�T�7C)�+������_I���"���4e�_��������l=M�u8<^��04���O���)%�?�*]}�8���!	����*`�����������C2Xi��
��C�WC��! 2����1��S�W�����g'��������	bV��+���`zw�SK��bH&[�M >Q����n� p�h�|���O�^��k�)�l���.}OSv����~���C/ik�C�B�= �=e��RT��V�s����H�
��mu����G��;t����G��)�B]�����4��*v~K�U<M�-����Br� LSv]�j��������Y�I�J���V��4W��g��VX���y�ER&�8��T��_(����N ��Dm�n�DS���b+v��>L�<tI�,j��o?�DQ�<�{�H���\�N�&�NF��'�Di����bH��������/�$H_��@i��#OIV�N�*���H�CBEl^U_��H
��%3H�?���68Y
Z�����aV�F�"�/�dpb�)|���w�^������/�(��P��P��WO{vl9�L�����+z���[��w����12C)�:�}C(v�C�,-�Rx
(���!��-�Y�����bw���dw�T��m}1�V��)v����C��R�n����BJ~q��=��~�N<������7GH\�;��u=Yr��vz"}��VM���Ow�d�i�"9[I��u������1�/I�v��|��������~�N�P
��
`.31$=�������,a�s�@=�i�>���C����8#�/n&w��YI�`=^NNL9���y��yI$M��U���So�=�e������I�@��.:��S:���
�z��y�����GR���Sq���#��41?�AJg}�V�O(i^8_���rV��\8/�$p�Z���c��o�����>f�#J���P2��J��3%����8nDQ�L��Mp
Az'&������{M������n���1��$KcC�_��fJ�[q��u5��i,��=�~���J�|��C�U�G@�0�j���D;���������C�*j�;�T��0/��&���F��y9$a���"�g^I�'�~tFK�������l_8%��lQl�So4�)y7�&X����v������/���X���=�P 2������<��37W�>�Tc�:���(����T��Xy9$�X�O�#�_������;�����D'��
E��P�M��Cqfc���G8��t�Ej���sT��J
��w3E�H�Vc"�F��L��y��>?L�W��b���;1_��`'V]��yE�5�rfQ��TH������C�N�t4ob��!�����H����CIE@���!�tF��fvu��������E��/�d��xH����QJ�r��{�5nUN�H�q7��q�3�c������l5���W��I�������3��{������8'B���[���i���=/};�n�a2�2��Q��R���o8k-��G�����>Y	��vy35��Fz��rS��L��o�X�:�7�'��$X�^�;9��@�**��$��g���������wQ�_�j9<�]s��v���FT�La&|�|��e��xH��T3�`��fs�N�1O�����C����!hP!R�.������{+z-p�����|��y�O]^���VU��m2�	{�XA���y�h��t(^
�75���UT�K��yw@,��O��I�����w��3�=����q�i���	%���VF[_���^�D�iT;�g�8��s��s
���D �]$I�I������E��.;����7���$qU*�*�6�"I��B�6���^I2IP��v6�K)z?�Q����8������`U���p^��f���V�Z8>�������8���=�:|�nk]���o��y��V���{���`K?��<5�k���3�$���z[����,�5p�����`�CV��<��S��dI����sM��mkL�g2I*8(�	2��L����}*dp���($�"b�J&I���p���S9<+
��m9�y�T��V����T6xd����v�h!2���d���=��������������i�J&I�K)-a�a���L���dk�6����N����g;7���!�T�q4��V2I��yL�(�X�$	��|+
���6������������LV8��V���u����H��/;5���� |��u��=��'�W9���Z:���U�z^��I��J9�1�����z�8������g
������m�V���o��P�����=��������xBG�~>q��hDDv�����e�lN�f�d���4��oE����d���j3ZG��h,kv��s|����-�t��=*�ud2IX��>-�5��I�3� n n�>������t���t]{��d���h��I2Y�����I2����O>�J&I��eW�)n��\�!���'���$�(!c����$�$��q����$!��Y�g�+�$�rd����e+�$��f��}��������4vQ��w��]Xpi�.���#���������f�R�����_��lT����m����� 7�ehf�.��_����~������l`��U�^G�~����<0n���{E$�NB��g2I��������L�@���e,C�$�:�Q�<��%�d����99X2I�����%��d���+�v��-	�k�z���F�,x����TC�l�g��e�����V&v����l�f$0����$m'��
K�Jx+�$>�$�P���$	![������L��� l%�����C�����L�g2I6�4
������K'?��5;^KB� kx-iMBI�
)�\y�������HYG��S��P������\[�>�W��5#:����/N �z ��(�$��+��c�(J�����u`�����j$������gtA�)_��}����[w$��-T�^G�^{!eK������P�����^0I�� 'lH/0�n���8�b��?�tB*�)�7s�$�`���`���.�u4<��#�QiS�qv��;'����+����K3��q����L��X��g��v�2��SK��;�$��V4*�!v�I��d��z�5w����C��[2��3F�J�;�������N����yu��O�%�5o���4<�K�����h:�4���)�X�v�/�$Hp,X�?8�F���c��);�F�����n���f$
�������m4����Ra[xHI��83ppn����������k�Ax�$�sW$�����}�$!��Ua�gi�{�V�V�%��b�tj���]`}���N�vP��o���^8I��������^�����D8��c�~�}Dl+IVc�3�w��,���T���#���`�=ci��O���SJl��Y�����u�p�h��}t�]6U������&���F���chFRtaP���)C_@	L��
���}�$��,s�! .�$�(�DiB��I�/�`|pz���Q�� �cr5��Y�WE��*;{F����6A���i�O����)������9R�+���x�J�d�
�\�Y:�������^��C�97�����uf�����$*x����k��#�
Jag��/2�R��a�/�U���E

��n���M��zLL��eW�����w�6�J���M�|1�X��O)�$-����P����e�DQ�>&n�Ut������S�i���j:���(tE�i�w:��s�����0�r�yT���T��< ��%(��_�DZ���;��(�E�`5��n����t����N�-Qg+����pHoi�~�*���i�2;��F�P�6���t����Sw�IZ�����b/D��Bi�
���7���:6q��M�����O�]V�����[?2At�����/�vG��!��������=��@=(��{�|�}��/mm�����
B7�|������L=���>\{�
��C�[������L���P�t�_t���b�B��}�$�<�c?x�x(�����t��������}�$����p�f�.�$�A��Okc��K'��{����5��mtdZ�">|���hk�:a��>���I������<���
K";Q����H~�b40���L:I����B��e��$D+O��/�d4�J�	�K'	R��
_M�6��I���?�<��I�6���H"��t�_��?x �t�O�J���+SNLQ&1�����]�;�0*��	|���w8y.���P+�'��f���-q��<�R������x���������Mr�l��J�`k>�Y����#�1b��q���o`B�U�CI,q!������_h��b��r4���0Q*�X1���@���TM#R22����-�v��^��D��������T�[8>t(�C/���D��W��[I<I��?X��>��$���}V�A�����]<����9����p��>a����
�+z���z�us��En}4�B������D�/�<��RS�#G�(����B��[I6Io��%�l0��;-��i	���{/�f1���
��K�i�;��1C\<I�o��p\��IR�o��1e�m��������'O�D%�s�ub��w�����PY����w�B5]�a�K���z��*���&0�����fH*��k`�1N��4=��xFJ���/DCo(5.-A���P�g|���������j����`
*���z�$C����9P��V.�����14��N����y���w�$��������m�����zG��b�p�*��qm�������h���[`Q.O�v�����4�o���g�vC�
F����������UJ�����F�l�������OKY	h%�Cs���HyZ����\<I�`�+&odZ/��j�(���b/�	hn���?8�}�X��E�r4�X������k�H�?|�h�9��[9�����Bo��i'�gdFR��b�|t$�~#i�=L���5����$g����p��!^�������{�j*
�]���w���
�^u~�IN�����'�6��i�S�	������|$o;j�U.H���m��B�M���2��?#�n[!,�5R��h�b���S�%�F��>���3��)=I���K*���7m�c��G�������<�]�;��7��d��v��v�d��5��{��th�32����n�H`����Gt��s`��\l��Pw���gd�
�����d|�W�n�������.*�����&
P�6q��Q��bh���<C����j��a5=Z�����2'�7��]&Ic���t�k��c��
v����6�
�dY:�T|�G������=�H�����~KG^S�0�a���������B
���d�;��Y��l���=���
���;���*.��s�MZL�l!����
��?BI����}����JC����][U�sk�Zh���t�V�\h�:oC�����k���t`����$`8j��<������V>�����i�
��ur(	p��)�`���r��:</��T�� ��Kx�S��H��E�(�6����$�Dp`���u����R[�1ST!��$1�(������zP�r+H�g�Q��������mLk"Ip�������H_��u�>����Nm���l,���H������4�8j"I�������F_"I~���-P�U�����\�m�7��I�WC�6o=���$�Z��!��G�e{�����v<�CE��:wt����'���z�6��s���V����x����7����k�j��2n�����X�YFk �wY=�v/��z��!���3�LtDmz���.��L�VJR���I���a������^+=���P��A�����P�N������������=�N��<���u�L(Ici�F�b�M(I�����}�&�9��n����i��<�����Qm1�������v��UU��_{b(������M�jBI|"������Pc����Rq�C�]�D���C	%�|���BBI�`����,].�D����>���lr�$,o�������7�\{�`�F�CO�%���dX���02�6g8�rE������	������W����4hc�>
�����N�1���h�cb[b�����SG"��~�G����k�uEM�\:�F����i
��������Qk7��J�;g�%5�L���kBIzG�0�W��=���i����\�J�7����fBI�}F����yD��:G��V�G=��3�'�3��W��u7�5�l�|��F��R	��,���P����s���:�S�����	������td�6��n��'��H��G{�[=��z^��	�'/D��{���{��M��_���T-m=_tU9��m*��Pf�����)\OB��l=��)\;
�<\���R�^�5����[y	����k�5��p�'KH	��Z�H�@	�X���"I���mx��8B���;\����F5��j�����������R~��Q�1�{8��n����8��G�^A��xG��N=���&�r�� |�G�^]��u�L�x~FfF��R��)�Z�f�l;��6��YN0��������������X[u�!�����=����v lm����E��xA��vA$m:�'���E���\���4�@�QN��lY
w�O���4f���`)��������N@�N��[��'x��W���1�<I'��P
��n����(�7~�I��a�w�:7�<���H��sj;������/wy���Ko�p=mG���nL��
LMt������m�rY����7�zH�=5(�)\1��g:kj%C�H��b�rH�	\������.�-��i�CRt��1�;�R$b�zr��x*�<K��Z�LZ���x~�Fr(�hI!)��H�|�I!)�"�
��D8�'�Dq���������]���u�W��\��8�&]f:�^���P��|�d���U�Z.���UKOv� L�2���)Rt��[.� �K"	O��	�@�zwG�������3)5�j
�R��z�6�?�y��ZbHJ�X�+�>���djsC�=;��=#���(�qB6r���v��>�1��^MM�����$iL�lJM������]���D^b�N?v��I��h$�<��cG�;�=/z���
�
��x3B��tq���6����w�.��4�m��������$g�.�6�P\�3^�5��������{���S<u������c����'7�yt��0)�����X��Wd�����S������)qs2~1�����o�����2Hp`I?q�����8Q `����-����k��	w�����T�Oh$�w�ARg�ch~�|�Dk�Y�5jh^�v��k���bMMIS7��H��St]LI��r���r��c	���e�e�L������p�J��x|�%H��]?�������"A��������=I*�8�")I�C�Db��q�5��=
dE��2F������}�8Et��� ���9�A�y��z�8������3���Wd� }��{=�������d�qCKG6���?
����YKI
���)�_,�I ���S�!qH.��r9v>�K 	Bi+zL���P��B�H"��{�.W������Q���1p��F�EHQ!�D��4f��+p���`�9i�(J�����)]I�
'#N���;�!=�[�>�g������[��MS�������
��Pk��������E���tRm{�|#ia�18�
;� y�[!:P0���e}L��*�5K�������@�F/�������^���1���L:���&�����G��t�Z����"������W5���ttFe�~4*V����S,O����~d��y�oK;Sa��;%�'�O�hR�@��C�*�n�/�5Hq�r�T�L?Jb�uh��8w��RQeYO���4a�3���@�>��7C&����n�m�������^,�����T(�3>T3YZ�����p��E�:5S��[���#�MG�K��'x�7���}��>Q��~�J��J�%�'�IiV��q� +yO�v3/�c��e�����d��n7W�����a����[��t��I�W��g�"��R5�!�����J&M7�
Qx;G����AK�N�c��E�v3�>l��J�x��i�=����C�r�d<��G��U���z����X��qQ4�Z��>���sV��3�b�&����t|6��	�	�H��}�fuvc�����Z)�����MOK�y�9~�I�]�<���v��������=0~��td�R�e�#E��K�U��rCo(�����_�H��H�R��������Xi=���(�}~0�0���?�8dZ`���v{����N!���'��n���)e��v�n��3�/U��p~���W�����EZ�n����}8���>�b�7@���_�H���U;�Cu�~�#�*
�H��.}$p:��T���/}�
s!q���z����y'"s�����^M.UF�EnOVio�	����gW�2b�T�'~�u�3��S�B���g�'|`��<*�F?�*�r���^��h���(i@���H��d��X�y�?b�'�B��zamx���~�#�P�#
7�p�G����	�y����"���h�~�)xG<J������[O����/����?7*��Ut^Z��?b��}"d�]p��H���fQ���M���W�%�h�<xB�6rT��v8=+R�#v�����o:V���D���x�pN8�J�������b�K������8�7q`Gs��xY��^�U�~��#�������-=m�(�����������=&�����axB�Y�i:��|���9�j��������<y�gZ���i��'��v�~(��@�
�=[�Od��s�A���x�Z��/}������*Dq���;F���c��=����*������}T�[x�Vpt[�:2��@��wzXcS}�#T�
��%E���F���L����������~��?9GCK4�����%�����Si6Q9����Q�G��QQ*5�JD�S������.������X�G�����XCZ�����D[C����3��4ul[���P�GtZ�h/>	TG�6�@���<���e����������d�4e��H����${����	�b3}B��^�Y����Vd${M � �b-��5� ����_{���i����Mw���GH�ihUT��e�L������F�G�|��7��G�����/[�P��g[��*�lU�$G�G�F$�����c\���,�=���N����q-���]���j��1'h��d~=���������j�XV1��G(O��G�`��T�KE��
 Q�x��Gr`:QC`h0����)���AC�h�+xrfKc'_���W���k�0UT�G�^���RxY+&�D�T�t��������GX��A@.��.z��"9��5=2X�4��������	�h9=�+v �$[��#�#�h6�z>=�;�!D��q����q���m0=2t"���i\37����zP��<14���tc"J�H_N)���=1�.�i�Z����5C����g{�!cO-c��B�������g5j\5|��	��v?�b!	K����|����h�+����i�#a�O���+��L�(�?�@��	5@K�n	G�G��+�
;��5=���<�^H�,�"��a������������ ���X^^3�*u����g��CH=���CM=�������~d��#h	�����Y85+�.z��+��y�Qp<")c�Gb����3,�}/��%8�C�H�H��i(��m\�H�d����[���
�����/�����RY?���#Z�8-�_K L��=�U�b�*j�]E������x�}��9=���];�5�5#Ek�����y�I�GVG�����v�9a��N��n~���Y���k��Ga
�f�G�kY��e!�/3�h��9�������L��=��qdG~���gr";�Op�G�Wo���_�T��H��$��� b\�zn�P�� ��z�"�-�Ol�<{A���/{*Bp$xd�M�?�#�������=��v�m	A����}&X���S��zC��y�q�r!nw
D���j���{��n�g���������{���g�P%l�c��U�_��<"UN�n���Gb��(�8m^:�Y��KP���M���xY�0��J����r�i1U��h��7$�H���G*��+�b���f��C(���Y�{/�����Lc�q��'�4=��^uM���]��W��[�K�#Y?=�<8�\���������\E��;Z�GW��s�������'C�������26�&��p�g��u@��	���td����H+zNJ��
��U�EI)&G���R�O;��3�u��
���g�A
�V
?=��tc#��R
��q�S��-\��G���m��.z�]X������;�+Scr���������-�%5��J�2�z��;�u��j��A��s�������'|�b�;R�'D#U��N����<^��6�80--��e�owC���C-��5��*���
C/�fB�iO�k�V�-�9��{=�R4��R�&�0F�w����x(��E����jscQda-�^�������rv���14t$8�*{�GJ�?d�(l�P��!B������#
�s!(
�b����e���e���d(f�����#h��>��F/{$�9q[~-X�.{�l���#DWA��dlm��j��2���=��%���!��r���+*U��a����t0�1�!��~o;��d$�~��0�����l���^�Hg!�b���f���
���;�q����37pNJ
���B^w��5���<�Zp����'k�4���X[���D����5;h'�E���E��^�R�dU�
bG�������z"H�]��P�������f�<2;��I��K�~1���I��K���pB��|���9?����`K�V�:+:X���1����4J/g��eG���@��6�Q��������*��.������dkbh{���\�H��5��a�4�~5H
�������,(WF_��R�%�7x3���=��x�WE�[�Gj�@�rm��O��G*�ljz�d���j
���YK�G�`���9���4���Q����y�����E���f������#���F���6_v��'����8mi�~6N��}�.��!d �5����d��n(T�
�B���=�	 �R@��^�H�h� �N����KR����I>��]	�����r���x��`f,*�:*����������)g��-d�u���;�*������?x����t���mX���D�9	����r����b���R��(����1����5
=�#��P�Ka��1�W��<`�(�/�����^���L��~�#q�.�����R�MSg��;�zI+[����������#�8��d��d$5Oh?r�/z�;8�������B���=�<�#�W!��s�����D$�������������W��X�}c(�Z��z��������:t���@1�KkRc��{�~<�V���j-~��a
��aq��&��_�H|?J�`}
&��Y%���4U���Gk�[d��������:Q�E=��d%�,���N�����{l�N�v��/�2�h��n[=
'���w?r���H:+�1h:�G���+2��z�����6�~���z���:&��P�b��kj������m��zd5Z�xq�q_*X)�ZF��)
�Z�d�sV��-�nl5���dtw�7B����_\K���I�g���(���D"�V��e����p0����S�n�M������z�#���(���i�k�9���a�b�80�)L��=���\V������d���a2L���U��l�����]j��n��"�'}�F�?U�Y}�`J�v���3Z���{��$]��zZ�G����J�KC)�E�D������wz�#���G�	��_�H/��
X}����	G��h���R�GzWB�`���kf��O@���p�K} ~Q�%2�*�����]=Zu��p�7��p��f�?bV=zdv���aA���
�
r���h��D���mh����E���Gb���	��G�,w�5�V��	[�~6hgH\.z$pY�C)Bh�v���h������<���������QT�8�#�G'�y<i"�yj#w����5�������}Z5i�� YFEQ��6�yl�O��`:�5��9�+l,bX����d��%�}����G���B�p�U��D�DZ	4{����6�A/��q�k����������=��7X���y�k�P=W���c����r���-��#t���aGX��ka������xO�b;7������fE�a�����#���������\�)s�=�����i������A�iXL�Vx��=ok�����'�����Dl���oH�:���}@Sp�s	O�����tXiA�������P���%�����pA��DHd4hX��J~���|�G�����
��Gh���}YU�h�Z�Yh��=u������
�L�����
J_�Ii&|�Q05��<�������5X��c���K�@�Y9��#~n�FvD�����6:4�^f�-<��f�G���s��[�L�HoxB�5ihx���6v�h�*zDB�GM�8���	#|$>3E�������J�	h�u����	�I3�02�o��j�����<����_����l�)\��Dmk��6�V�-u����D?����������8��+&5�P�����B�5����</~��	BF��y��9�t���o��DQ���;+��K�#l��\��&sB��4q�����`��s^�H�[)��������>��L�H81�UO����w�K�8���V<!�������gQ��L�Hl%$Q�}c��'G���E��k/ \�H��N��*L��(��
X;��m�Ia��yd�����}��0�>�j,(���<U����Y������p[�������y&}�k� �f�>�y����R��A�Z}BG�^������Jk�h�+<X������[���+�S��q�g�<E��
�&�#dwdD�q4{�}���'��dC	FR�<2����x+����=6�YF�[e&}�U�yF���$�#6�	��8V$�L�����p�lD_�G��Qn	�9[3��W���W6!��}3�>���p8@q��>����y`���$�,G����4��>�}�@
g�������e�C���:�>��E�
c�)���!VVNT�
��G�� ^Hn��V�=�v��2~��h92��y��>��z�$z��2S���(�~�9�L�Hia���s&}�i�4�����Y3�#ua6��a�[I�He�����VQv&}��6��E��L�H#m@�G�����0xK{B��x^��D��~>[=*���;\�*���?��r�l��IM��v�z����X��>�;;�|l����#[/��q[0�3s����V�~yeN]�:��u.���#[���~=[��)\�z�G��Qn��s�V��J��wd�Z���8�m��o3�#��R�Jv�N�K�^�{�'�<1
������-_��@�3E�����GC�kj�'}�#P�+�h&}d�@�����H�;*�e�����w&}���H�'����K@,UNEcGZ�#X���wA�s"j�6!�
��	������Ll@�Z�B8�)�� y���k~pl�3_hn��Y��r�]~��G���]
�����'M�t��Ur"*R������,��<?[e"���#�
?d��D�^Z=�^N�P�~'�&z�����9(��u�#�-����Xm�zo���sXl��%���v :�������R��b���JKv��
c�j��i	��r�1��a�>R;��L�������1c�w:��e���m�&��HO]�W������w����d��Z"���b�:���#y�K��H��k�m�mw��d��u��^A27�A����<4���f�O���<B+��<�y��� ������s]�� ��$���}�G*6��r��{���h��]��C���oZ���O
�Nf�
�,�����R�u�#0p}_�R�a��#��
��R��:��lf������p��G&(�� v��(�}�^v�4�I�]���%���5���]�oV:/�H����8k�No$����8���7��Vn#-M����
C��iy�J7v�n�@��P@V���U{���O����6k�va��N.�=����b�����[�g����W���
]	��.����5�\3c��?[����i
��%��O4�H}5F�R�8���O��J������k�����(�z��S>����+�#S7���������v�_*�Vz��*�C�����w]�H����
$�u�#��R�;(O�K~��#�\�2,������(0TI��b��d(0��\�����A�-[�"R�����cd�����>P�{�u����Dau]�&f�DiCX�zQ
�.z�>4$���c��#�L
��U�\/���]�8����g��j�x��$JG����z�9������8��{QjSC��F�p��C���5�}�@�|�,�W_G�~V(6���'g�����'*�i���M�Z�w���I����v]�H��S��p�IQ��C�����.y$�LZn���C�G����J���S�X�����dq�����?�-$�H4M�.��=�v��,M^3#)�}H8�t|�G��R,�?���:�[�N�� x\G�f�X�M��V��|�:�`n���f%{����l8�X��R�����C������]�+:���ox�ft������>gS��u?G�JGv��~4�m�r�J��9gQ����n������.z���;8��z��-�p48�j.zd,�������m��U��=�7�2
��N�H#��!�����k:�����������%�Z]s��j�>:�������47z!6`�����}_�HSwW0a0���k�����B,��Q�)s7��=��,A�=2Tc_q=��=��e��r����/���>��SK�~�G�"��M��}�#��@��?�bl�0/���T��G����;
��[=l�����0�������{��
`=>������Z��t�B����o�L�����zs%m��F�����l��D�h��=���+��hn��j��`J~8��W��l�T��g�b���=\M��<�����#���,Z�
�� ���C�AK4�P�{��]���ou�`��J��$���[8';.W��N��F?	�H5��&z���{:����$8j��z'z��� 66��t�����������i��&���c�rw����j!�����i����Xt����=��x���������;�����4=���]:���!%��%���n��)yd���&k�
�~p����*'o�{w�GZ���c�c�O��}9���y%�FY����luCa(����"{�	[
_�	��ou"|���k&[�#�������m7����,�9�
K<6�rt����:r���J��r���������mP��k��h_�;���_�
���r�^�M��O8��@���5z3���K�+�U`!�i1�N����E�8��q�N�;|�"wk���aQG��E5$���[=r�0��&~m�������y���ND��rw��Bi�iR_L��OV���#,a�q��X�A�����^�-av�+�K�^dU��s�FR
��8��N75��+x���� ��#x���+�@'��������'�4�Ax�������Z�W��$�����q�W]���]��jl��4)L����������
�x����
;-�[���S�>L��$�D����f�k��;�EJuY2��Q�����x+Z�kZ�����`@��L���%y$�cK���Wh�{�r���h0����;U�9�k���2w�<��?q'���k���7xa! ��]~k����U�j�T�{%3��:�1y��
Mh~�UU����Dm�M�FX>t��.h����rw�Di�7}:���r��Oq�����r���n ��s_�;V������[)�AC�~��U�#�i#_k�J��0��`H���Y��:�1�s���#r{(�R	���qT�ud��~��f���U%J%\�d2[��#&GV�n��};��i�!($�����:�a�;Rs��ju��=n��\����6������&G�������gnPv�37�����,�����p5E=#�f��������<�AK��_���M}BG�v�Z�@�QX���m�7<�0P�	�A���=R�����
|����]��^��=�R���'s_y�5d����B����tV��Hc�������KpG�'u�&�k�P�V��mC{�����y�!�n�;�_#������*[���OO��]��#j���~o#�kBA����	��A���azFrk�1b��Hh��C��P������������:�E�c-��7�
^:�\�������\��k>s|1�`:r��Z6���F�������t������NH����'n��=����E_�'{FfY�;��2��<k����(J�K��#e?s[k�y�:�){���%���>��d�����N���r���7��t
L����s��:V��ld�	5�[�<tTl72�6��f�>�:�t���5�H����ybhu|+���1������P+�0����_X���j��	
{FGoY��7|�����m�H-��OfA9��I��F�o���:�\,�,8��[�����N�HU�3�%��#_�����a�Y�&G������x��=FZ��y�v�����Z���ybh)�d�
��F�g-�m�����G������+���/;�������E��Q��$����OmK����55��8������g-��P
����rt�����cc�M�H�����Q����Wn$��P�/;��W�N��J�_���h��r���5{�J�H��J�A���u%�#�Pxo��X^�!Q�i<�w�>�:R���P��1�^���VXMj}{F&u�����><#�\SP�4���		���������u�9X���	����[a�F|����	��qa]��u���_�/G�^�PXc
������upfE�����~u:����O�T9���^���X?Y��`n�	��3�0E����������:��k��	�x��Kal(��IRG:	Z��b�k/u��x��k���$u�z����O�`G���%R�N���$t$����N��n5��:@r���T8�3
kz���w�~�
���CA��l�"t��T�5qt��V&����<��9�Z�$�	9[�k&U��������m���I������W/��1�k7���<��)�o�V�l��3�E�I>#�,^���|����N�S��g�U�1i��
���so)m�\]��9�nB�v<����������V	7*
ad���z�#��rY���i����/�D
}���]HT���Fo6�,��#jc���_S1��N����hl����@~���!����WGW�:5�:R7�#ev��g>�=��oY������Q0�n0�<#��,���"&JW�������7������A/_�a9<�^�^���R����%�4m����T���"��4^����t�#���RP�61i_���0<���^�H�^��}5�82*������iMv��=]�����}�pf���ghNI�%)�X����i��%�r�#\[��|�.��:������t���z������� �
�����-����L��FoC����O�uQ�m��8�e%
A��ta���i�5���I�GV���4n$�����b��KX�����{k���n�_���t]������j�,��Fy�g�q�"��SV�,G����6S6���e���0�A�G�v�8X`5wY�f��c��4���:~4��v��Ge�.~�E�|*CP���ghz�=��i��
�z�q$�P�N��34���c�^�C���9MS���P<�*���	5G��k�1���������z>#o����UZR���#e�h�������RM�H�*�����K�����g�^$�e�����?J����������

a5>b����T��f��5���=j�uME�LSKT=b�Z\���#���k��
{���G�8��}Ov�U�W����4�z���{�l����O\-��g�����l=h��GR��nY��-��
������:���@���/�xDs����t�#�1��������8�x�BX�>�ucI��qU�����[d��_��g����M"����{t��Q�X���O,.���)^��5�����<)w����E���9�:��2��*��NG�+�;_��i���50$�`�
�����FO�jv��������pv�������`k��+��x�P���]w�,�A�^l���]�A��d�i���I��z�u�?����V��'o�����s�����1�y
��nH���
:������A�"\�[y��N��B�{�fmUI�?�Dj�mG�~�����N�E��V^���8�v������BR��c���\8�~���%t���u��:Yt�kG�q6�xTk���f���vW�����#�7q�@����x���n[�A
0=�k�8RX��.D
�z�^ReU5���������3�[���4e�1���zCI��_�O\5kD"1)��~qc�1]�����4�i:ft[p����9�]
2cC1�VsO;R�h�|��r�����jb�<�������������Pq�o�[v��j'��tIF[X�4���l�X��Sc�<C3���,�"
94�p�gO�Qt_�L����mb;�]yr`��T�[x������M���?P"n4]j�v���:1V�9nK���M�X���?���A��{�p�),��Q�7���������=��g�5��
#/		.m}3�
��&{C��Q!~����:2'�h>��(R`��}��H�`� �6q�������<9��T�[T:!�����^�m��P���_�e�b���g���jw��~pH�G��sn���QT�?���D�gN�?��v����W�>�F!U�Rw��m��P�����6�rGp���P�xC�_�Jo�����qs��P�����sG����'�z!�xc_'�d����W
`��T�k�)�*���'�����e��z�����{�0���=�!��f<�U/�p�����Y�T�JC��Z���78����R�������6S����Y}3�x�4����C�]������J�N�Y��~�v���15��R`�J�w����^�;6~���*�m�����KQ�U1y��=5���K.$�G��S�r[#�����>�f��)x�	���K�n����H���� ���#�����>@�@K��1�Xo7��Jy�i��F�RqM
��o������)�����d�f��W<�k��
��:��k�&����������<�d��i���y�(]��_��E������m�*a�A�o����.1��)���"�v��}C9^��oX���=\�pC��_�Q�GU����&I����HO98�7���
���H�S��NtR�A4��o��\�(��M��W����AmG��-u�g"�N����94�A�FG���.u���f�T���
���K����k"?���-��{��]���Ab<J�,z<�-��4��>B���(��
_U�_���=���f��d$l=e�RZ����C�~dn��Z)�P��#�~T�Y'\����^�~Dnw%��5u1]���rCe�����tLyk�Q�k�� �3y�W<�9Q��q���PM�"P����N�u��������p�^�VVv��*���g�,�`����
G���#o�'�Q�o/��P�Y?��3���t�d���q{v��gs\���h*�
�=�#�[��Z
A�z2G�N[��X�]
�3y��D�m[u���svX�*(;Ew�==�H����1U���� �(��i/��C]{�?o*1������t���kfM��&Vt�~�\h���<���:=����RG&�x����U�t��d�D�>y+�V�>�+��|bW��[}�=ch�`T�������9DV�&)E�Z�����*no�Z�e��_*�2=U��/������y�kPY<6H���
�d���Kv�s@q��{�������o5��G�^��l��h
o�X�>$��$0�[9R���[�PN��n)�Q�WTJN���x���#d���k�������Wg���=���yd��
�%]W>��~dl��Q����d���+h%�Z�����=����j����+�yb�����R0'd��I�	�oa����&��:�Z/��4l�ll�����X���=��l��&q%,>#�N��&P��oS�����Y
_������W��'���;�~��5���}bN8�����J���
b��!o���k����^��xH����F�ujwtrwn��G�^�pY!�X����z�w�D���b�NZv �U�d���#\���.?�"�*H?���VN���O����q�{NX������!�V{�t�<�PT�iO�m��=�#M��g=?��2����+;p��KF�&)Z���u��.��:�h�>#C��2G:��������u'Ga����G�o�5�����9RP�?`9R��3�2G&������@�S���|���L�z ���k5ZO���b�Mp8�S�ve�~yS�=�#����=���pdb�K�������d���������Y�j�� ��zYy��U!*�����'s��+[��v���(��
����;yF��`�<I�O�)��0�{2G&����Y�V�\mM��tm{f�V�9�,z��6�J�'�3��Pt��:��X��9�l���x'�/�#�qX��>�a�	%s��	z������B�t������d��t�o�}���
-W��F���d�����<��;(��������ll7����W����llS^���^T\��L�U�Z��X���H�'7I#�i����Xj��O��������$F����xD	�1`�����s�,�SX��#e�����M�����=KEu��2G��V���j�`�e��i���F�e��fB��;��iG�^Q��V=����09�Y��K�f�����:��?92�DP������q�����|�:AO���]_���s3=D���j�Z�*��	%"��xF/��(�X��9���
����-{,j�BG��A_L!�Os�qm������m*��^�H����id��_T��^�6R��q�#���6s-�:�����O������'��jFG��]��F5Gw\4�W���Sm��#G�~�����K������9���L�����k����>�L1��
�Du�"1�����BG��Q%R$�;�HZ�d[i���#L@Ka���v�#��z����q�#������0���#���q�i1
�v�����t�#]�CbaDG���V52-;
w�G��qz�z�zP�G*��@Ot0�Py6������;�����������M��q~2���$�BOua-��Q���H�Eowj�#�����������v��#Jl���z���8���m|�
2�8��	Y�7H�_�����T�3�G&j��W�K��^a�_����H����������Kv�CI�W=�!A�R_�du����61r!=3�Lz�z<��g=o�����*+�U���]�
|���l��}|��(����f=���l���H���m�~=^@;��Z����Cbcn����O17��z�2.s$����Tf�:�=;�����kA�&=�&)�E���L������A�e��M+<���������J�u�?U
�#�#%��J�� ���h�	�j����#������]�H0���������4F;�l����j�0]����D9Z�=#s}sm1H*�~�������Tu�w��9���p�1��9R��M,��q�\�p8�m���H/v�
^l�,B��t�#�|�g�E�2G�'�����C�����\	��b�pr�&���U0c�
gp��g�i��o�������F�@�}��l��%8�,���cY
����/?7K��q��AH���5�`!����d��2��	mb�ETl�xU��jG����h�/��EF�{wk���%{GFE�RQ���G����l����f��T!s�zd,��-'i��YL�>K�v�V!@2����(	WA�W���BG~�'��0�bb�2�	�����XGcG��k��C���\���mZ��.M#v�:�G
���vz�jI�Q������.�q�lP/�i���#U�A�*�����_�1�X
�T�C��F
U��<[*�U�g��+�@������D66D��}F�Z�#��$�Rw�i��JX�]�HF��A�
~�%��vP�g�
�^J���<��T��g<��k���`4,r�2x@iG��m�^y1-�)z�li��l��e�x>r�47b��H��+K�H���������k��":��bh���E��:��n���Z�h�^3��;z�a!�v�#�W�`
��]�H�� �{
�F��Y�b�W
�B5.G-��#u���3jY���u�jG��~���m��/<�#t[�JD������[����K3�y�}�R�n�Y�|����g���'d���������:��G�SZ7����L^3'Y0n�,�#?����F�uK�H�Z-���k�{M�vD��ltOo�9��v���k�v�����S�mk]�m��va)v��Q���B]�S��Ek���G<��f(�Gd�����T�G�'�jF��PZ$�0��:f�8�t��c�w~�#e0� |��$��JN�����Z�����L�{N�vK�vUX�v<3~��c�n�&����G�<���M��k�5����:���I��;bNHE}��f�����a�@���o�7*hl�����
�J���j���^������H����%$s���V�Q��*�i8
F���[4c���lS��]6��p��)�����MF��EZ����3��'��[Z����a�i^�6'�}�>��n?�?l�C��V��wL�K�-�k�u{W��������>� U�
R���wd��+6u�����u�����q�j����is�*��i���pS�nF]�������=��)Z��1�}{EGbmx�u�V��\�H'���yCQ��Fb�^@XW����VGS��pQ#����T�
���F;������N��O{Y�{��bWO����}��;�B��T����u��$�p�^����g�q�g�H[(�r(B Z�����+e��<P�U�*�<�wr;i���p���b�(���=6�0�i������s�-i5IGQe[�f�F��`�55�3��A���G��7}���5����`����F�M�lO�H��lAvC��Fm��m�f������k��)mG�'(b�ybh,�g������������th�#l?_����N��0r�/Tu��m��5�����3^�LF��?������j��@�w���h<O��U�������t��(��/�#���i@�T|g	qu�,��������s\Es��K�Q{�*��kv���h�s<�J����~$�9;"��Q�k�L4����V�'i��1?�ne�m�Q��5}^�,�3�����\��3h���'��R��W41�jQ�_�X�yQ��9{n���q�u����h��8W�>���W�(�������Y�����h^�	)4�n4d�,�#���&P�����\�����':���U�A�A�������gJ�X���(��-�#m�8pa���w�8.Q�y������/��O�H�x
��'q���vb��u�����Z[5G�\���I1���8D�j��$��/^I<
$�G�^������(���%D�'�@u:I��wl�'��z�����:J�p���%q��(' �yGB�W�(~�.q���C:\S?�#_?��a�u����z��u�S�@g��{.f�����xU���]w#_LY_��^y$���f,]��B����d5�3���R�^
b�����!�(�+���(|O������������P1p���i���@�����=%�	����?��.�#V�`�}�z�SI980��/&{�&���^~�#�k��q���3��A�+���!�G�^Q��:S� ����z6G:�����~d�������l��|���Y"��8���Z�K2z�����l�l�8aj�-(���=q&��F��=e�������	�3e���9�`uH��O$�TQ��)[���|�$��1�}����������� �o�`�^�z�?Ei��_o��C���j;�Z���xG�������Q�'���O�(�#q��~�G���S�=�n����#Y��H�R,���W�}���E_[<��!H
�����^mp6�_l����]z-��,��W�IR����&r����N�2� )�'
A2�$@�*2Xh��v��f���C�5�IsB=b��q7�m{���p�u���5%�����D]t5m��������}3G&��
"~u��<�&UN�n���M���,��*���P�^����l�R�t����yM����0Cq���{���x~��Z�'�-�����(q��6��ti�<RCV�0�Z	����������D��#�N�C)K���`u����|���D]��{���_���Ls��N�_jQ�V��Z3��6b.���p��xD1����{y��9���rE2�{^��D4���*�3x���b�q)��!.5s4�����s���U�mb<}�\���o�r������]C���C�r;�Q1�2���.���8����:$]2�7r�U_��@Nr���B()��T�4��8�5�m�M.�fH�0�/8��3�^��B��J�3@k���wLU����ZW��qe���R-�*����*9W�=� a'��q5�))H��/|�	��g	;�!Q�?8�_�S����%
tn��������~jl��5�2&6��N��#8b����2(��� 821Eq�S�#8b��p�T#B?��M�N���*d-������1&�'�!���j��h������6�4:��z
w��W�#�;�|\�A]����H�h����7����4�2g�z��$��I���j����J`�n�M��n����6A�!�g�>f���R�' "���u��_���+���}L�o��ov��lW�M����8�n��Gq$�~�(x31��9�N]�F��8bl5��tNfHg7rJ�V6���<�ee���g�>�I��
��9f�>�\
c	b�~c����#b0(&��H]����J�;��t�
��Uq�	v��K[��Qd�K4w]`_��#�.]8�b�{�Sx����tw�Ss/%�����U}�����U_e�X�tH�n�L��]}�i�:���|�!�+~h�������>���d}p}���W�#���|2A�4�����e��R�j ��}��
��1g$}HP$�Gr��&�A6Z��9��1[�W�1Cg	g�?fR�Wy���-DG���!���7�Rlp�5��*k����Z��,�E�����,�q�8�#k ��%%\5�����G�*r���������w�Ard����T1���|�F���Q|';�zk��Pp�1�x+!��,���+�J�K�>�����xA~Pc:��2���}r���
SU��j�>�#�( T����P������Z��
��|��=�w����h�
7q+��4"i�[�S]�u4G�BEI����h�d5����j��^w/m������}��x(aJi��<-���������_Gp$�P84ALc=�p�b�
?��g��v���;P����
e�l[�f�L�B�L(k�V�X��Xw���4	g]�\I&�*/X��Zu�6��*(�
��!��Q��R��x�G�f�t��;C�c�]���5(��V�.�b�����@�@���	H@�J���Gt�~�l�}Q�K�R���
�,���A����>�#�
�{-U%�tU�_�9�M�DK��gN�m,���[����Pi�Hh�u�jr���=.����)�]�H�E�iVhk�9�b�8��k�������g%H8����p�����4G*v����h�X�>u)5hp�g|����������������5��I������^-�	��F�7��������b	X�W�����*+f(+V��V������c��!�����I��]&����T���92oJ�b�4G(c@u���GiA/���J��h�X1���0 �����n�o��P�8�=M�Nm�������
��ir�8``e����%
Hc��PP���i���~B�6d���j	��W!z����!q�������
�����5���5�]��_K�����o�OCIg���W��W}.=���F���|�4n�XG8swV��M��5_rQ�^����EC�F+�+�V=!C+x�(U��N�P�V�ao;�$���n+c8;�
m����#:9��~h�;9���)(��Zx4'�
��F}x���u���J���;}LGs��(�����>��Eh��,G����5��9[}3��K@�S4���� o�9�Q�-����@wG
L
�zW�E�	�Z[��HX�� .N��Z�.�C
��P�[�9�����k����{��A��Gf��������|�r/����r;�y�\b�~S1~&�
��M�K�������v��c��(�0�[q�J;�� ����v����-
*J�^GS�$O�h���+�z����7p�z!T�����DGZ�P1�S�-������w<�p/*77Rh*~k��Z3�� }����H�Ju�q{LA-1Uoe�5��0e��9��'�$�2J+����S�IQ�\iUN�N������`N��mG������Zo,���pw�����9�5����o�����a����vx�z@};XB��n�=�l;<���5Cp�C����E���#�d�P�ae�hynR%�[��8�i��A��%h�������R�AO���	������H��C������1�G�$�r����z�@'V���b�����	��c�g1����G��
f_C�j��q�����R�vP��������s�������vMC��
.u�4g�!8b���F!Xi��#M�m
�7a�}G���o�2��BkxDE��"S����}Gb>��������9RYI$�3$�'�/����n}6�p6re�f�x��g7�v<�O�p��2��b(�.iq����M���m���Z�����?�#cc
/��r�����+�?8������Tp��;=}�d��p����n���h��W3�]�1�&��P�l��%���!8�;��20�>�^�+����FBP����nC�;)l��^�m[u%TCx��)��XxJ����7�y��q�u{�U���`����������UNK�����P�T��M�d��H��y
V
�2�F*�����Go�w��O�5����7R7Yx��]�z#�a��&X���7�p��AN�����9q��<z#���Vz
]�	����
�Q��V�un�����e��lI����^�}�dk�p�\���&��W%<m���I��5)��J�q�j�:5%��l�������l���J8����C�s�&�u����b���g�FBg�/x++�	�����J*�F���w�����^VN�[|�:]��^��*f����H���ak���7��f06�0������++4��yd�7��y�x���-��l��hZ'��H�&��KIX����y���xB��D�?�m�fvS���f����N�S�+�+���/��0�Aw]%#n�Q�Xf������/�
�H�L_9��2G�A6!����e��_6�0m������k���lx/E�/��������Q�7k����,�8��c�\�W�Z�	��Jj!���#���+1����:`�gB�/�K�����V��Z�O����l��0�~�K�g���U�F%�W��n�W��\�T�f��H���a������Xu�	'���������~&��O��C� �ZKN���P���
51���V��k*���RP��k���dc�LJE`��!���a�T���tG8�W�J��������_���Cod����a��|s��IZ��
?o-�&^h��-h��%i�9X�Zdh��������$/���)�����p��fr�z�Yo2a`�7�<z#�����?G�{o�+lm�*cq&G���P]�	�y������Q�4��E���(��J�_�:�����}e���/=b#u�0;�f��|����^O��W�I_�Q�&Nq5}LA�No��������XN(B��|i�x-�������3��:��A�=
'�J�]M}��W\������d'�1��oK_jp�S��������`_�h�HD�x���46�T�E�4���P�%��H���#p4�]���9�#�(r�w����nl�*t2����Xy��t$G�6�H�����Kc�%���z������r�S��OO�l�N�mlo�^�E���@jBVjq��x�
4!n��^AO�[�6��q��O�`�Po�8O���u]��X1�p�P�ny�`�����k��QQ]������� 7���E��=r#S�'��A}r#�f����T������%�����p�|L�7���QA�����l��5�����R��B��
��lo��I�U��6�u��2K%���u�*0Y$�:F����x���JD3r��!��4�.P&����If���2ej88�={Cg����F����{L�(���#7�n���!2�*���B�|��#7b����[�W�K}��4	�
��4��&�mpR����j
)�9<���n���������'s�F���)���[
�#/B	�:�E���d�q#T�������q�ZNV�!�4a0�i]���6b��Jp#$����H�p�]��Nl�Gm�M��������9��m�����[�6e
=U���ki�R�\��Ht�~���!��Sc���t�6M� N���_�h�]�E��v��z(o(C��o����85���6����4HM������>"Z�� ��6R^�,��Q���������Z0�@�:���6b��z�(��Q�]�\�����MY�r `�#}�������w��o�J93��
P�m�����+�kec�F�����_�����io�R��k'}�����B����������k��Cm�\iT���8��uiD�d��H�z�j#�.��_5��9�F����~5�k����U���h����]J�(�`��������9e<�h�74����b�kfG��}�R�OD~�|�CV�vD~��n�k�Zd����VD~�5c����x|��@����}���-����K�l��d���������K����p�U#�jCv&�����Y��=�hT�Z�FuR�dq�V���6B��]�M������9���Y��Hw��u�i`����k�O��%1!�3`Z�:1�������R�-��!b����S;�%a���O|�h$t.=)��/dp���s���.��j.I�qnm2~ui=�T7�g ��S3?�����i i2�������!l���������V��t�Z��6c{	7-CJY�)fv��e��W�.�
��:����i�\�!���C=R#�/��nZ)�������6
z���J&��t�g�����U�z����]����
��9r#m�vSdA������V���i�/�'����l#����Q��k���(������R���J�����c�W��H�Y���O6~4ys�Q�0��2
�C;�4MOR9��b^��yp�Ff�Ty��v-}�$$%
'�������~��0�~��G>j#f�_
�%>����[��X�xL1���6U3eZ]��d��"W�����9��6PH�'��r(k_��];��>�4Bi��$d
��u_�k�s���:N(u��
�&�RvWkJ7�v�����Ui7�W�P�1�*���
���'_�gP0��H*���(��*i�����1��/��W������V��^l�V�������ZQ{�5��v�N~J��A�k&�RgT�=���ueW�.����S���&����h�r#+��������� te��0�UI�u[������w��r��I���:�R�
6�l��
��k�����P�V����H+��H�;����L
<���������n���Bgr}-��JJ!%Ei}:w|-�(2g0��ch'�<�Z�NZ�zX���r7�EO�����,���
��J�Zv��L��i���_����A�7%de��`o�\!�Wz�FZB_��T�	0�Z��f�|��>����C���!7R(�� ��qr��Z3j����������T@�����$�-f�0�\�8�}�H���Hb��?����w_~����_>��/�?}������/_~��O�����������������?~j����������������_~|��w?���7����/�������|������p��U�����7?�������|�����zo7�����:w�������}��_����������+��us�������w������~�����m������w��+���o�gq��m�s��y�������_�o��������������}��7�����e+��R~�������]����?_������}�����������?��Zw=����?��//��O�?���r=��~���_������}��o�+�����y���������w_����^�}��o�y��7�������������4����k���_��k����_���z���w��?X8�2:�C�d���}����_������C//�����?�����|���D�����_���m/��O�O~������`�����|�v��������>�������=^���_�_>�������������=�J���������ow������q������9���/����7����/�_������x����_}Z//��/��������7��t}�}��u3�������������?���������������O_�������?��/�����_�������?��������������|�/���~����SR����y��^��=����~����������p��������[4}n���?~����y���%�����<>�����a��������T��v�x�w�{��+�w������b���%=�������J(?^������������������������=�{�["������,���w_����?�����������s@Gw�B]=�TA��������t��A����@���L
�W8A���Sj��?������,k4��7���}��
K�.ln��.3�M�������N����q�w`{+�-c[]���_{/(������������x���M�:�������~�o�n"��~����x����~*���V����~�7[G%y�W}.*����Qn����<��:g����Zx�u����1�~�\�Knl��WA:�^����C�~}�\e�����TF/Z���;��lh����F�[�����u���I�R�3��"�8{?u���&�~��LpJ���&P5�e�H�] ��2~������z�xH�BO��0�\���s���������X�y:�B>������:����A�^��5���(�����M��Fb|"�
U�;���[\?��a�����cx�����R��)`�7J������)L6���V�_]��>����4���c� ������[%��;:!�V �n������z�������4C�I�\��e�>3�,e�ySt��u�v���wDi<�}Uzg����CF]]$/y/tv�>e�=Y���^g\g��"�U��#����7��u������/U�wo��1���	�`2��?���8��U�3db��gJ�	��1�O'|�E���t'�;��T�)�;\3�����!�H��k^�'^@���i��V����J�2<!#��3�x)P���%D$�5�3Yg�R�$�%����Z���'Z�q�����L�.)�Ul���������KY������*���R'&b2�@�Z�x���uq|GU90/�1/�!E�w���w4�S���KI��@T|-�����Cn�O9��~D�7?�$B��UI3\��4���9�r�-�1�oW�K��W��e�����\��H���B#���LU��yy��|.�.���%��f�n��zw1V���v�7-��QT@Y�|oY��a/��1������"�$d_�v��u��d���{�f�U�CHD�l�U'<��S%6�d]��(l��}&�Gr���� �%�o��A�ZO�����w���s�I@�T&�X!p���D�i���f\���\U�c�f���DLJDn!���aj(E���7�"��w�]���1v3�����(�>��A�6�n��F���E8%1�T�U�&K:Mb��xB��q�Y	}g��������&���\YN1l�������	�'fm�I��P�2�G���cr���Ih�������I�2r��:�����-��8�l���u�W�[���|�Z���������^%���}��<��I�6c��/��qI����Y5���z,�nI�������&w�[u�wa���p,��&�j����K���sLL��z9����	G~W�x�c�!v�jfI��g��r�dc�fu��B����*�����@�U���0KB[c������������t���7���,����5��A#���tc
����*��W���A���������z_8$*�����>�o�]�C��s�z����5��>c>��\���I� XU���7���F5�z���_�
4C3�3f�G~��1��6��9�U@?3U)�����z���d]����!;���Q��o[�
@����V�B�����-��v�����
����c��]<~�������� k��U|�Qb�Q�n��"�����5�,�cU�i��c��W>9�#����E�(���W���E$����k�0�E"��
����� %|��\�N�1k����|���d�/X���'�R�T�2Uo�Up��Jz�Y���UU���9������r%�w�v��M�o��<��a59�"��l�v�f�AQ����7�}��s0W
�����Z�p0�:�������E�4��$��g�����DN�oUG���z���-`b�+��^P���A��X����s��u�g-��|�v�����n��u�����cb�J�g������(u���e|(�X��-x�����2Y`^�p�x��@��"#1�� ���nC���g�#X$�Da!�^HKJn�Syo��e���"����/�9�/������u8p���T�S�Q1�rU1��KE�n:���f�?�����H4��S@oG���w�L��41����o�?��_I.+�@�������k�0o���!u��S���G�\�G���8wR`u?�����X����l��v`��!�z�;?�6��tW��[Cn�vXkk�Z��YCnEA���c��z��9�i���,�:'�z���UN8{'�V4�*�po�<T��E��+��^��%$�L��=��Z�����S�w_��4V��V������c���2#_���
���U�2�������]�h`�f����us�`�,W61��W�����{�95�)9rN��:n���bt9�������	�]4ux���4���U0��BV}�����JtY�+:���:d�XkF���!_o
���.��r�1�k��<��"�Z������S� g��$g�9�@����v��c��t��I����1�
����~��!��>*v����8*q���(s��8[�V7!}�,;��
���,i��	��[o��E'����9�!��n��YqyO���4���#�V@O3�����#g���{F<(a��'�����cq�:����P��G����"��.�j��2<�!�������l:��
���9c�6����1�UM��ikerf�m|�S?����`@7��LK���v+BO�-Q�S����h��3 �!��-HU���B?[-������4=:�<ZG�m�W��f�������=6��k]�����ZJ�/}#���Z��1���w���"�XF���O��4�Y�W#����)RZ�N���q��:��85
�&;~��4��������(���@��@�|�I(�L �N��R_�8���.;Ztj5R���52��L������]k/<Y����w�52:��0J`����i�c���B0
���4��c�ui��k��\5�9���lQ#j�c �
����Y��`���M5*��]�= U�w;�@��]��=<�s�I7����3oa��^�s����k��\�A%��; ]5!r,��$����9v���O�����n+��
��d%��NH���C~B��$�����!���>q�z2Pesk����_y>���u��c�V;E��8��MN��	7����c�4}��P
)��?�R�8��~���i�jb��@�����*�(�����h��C2%�R���|�Q����0H����}r��c���!�%�<���`�]w�@�G�1>0\�0����(�\�,��+�.W����>���[5���'$������0����4A�O����>i��|���_��_h�_��:��mJG8���L��e&�g��1��j�������+��9s���G���������1��)�����pH��O���^��T�����a�U1�|3�h��/����^���x������/�������`�5)�+
��X_����>jp)�Iu|���^�N������qB�i3��	�����=��*O��V��zd)*������q�B��v2����!LQ8��d5�.VC��g�11���5�)
��X39W���h��i7��DN�S�0t�Am��0"�*
��Py
;@��V�*���0_F�\5��!P�����~$}Gqy���4�L�5r%T�eb#g9�Q18��2�	��(��s�:z��|�/��x�`(&U�p|����SX0�P_���I#G��f`*B��:�2�zE�����	ta�EP��.��*��=�����`U��:�\s��b�h�����)
m��B����0Quy��Q�E
�������sD��`��	�CnE�b�MY[�*�T���m�R�TN����1XV`g�C��D|r�^q����*z�]�=�p�NA�Oy-5t+����sB�"�?��P?�P�HT+T:����0�q
%���j��xz��C������r�U��_),!���n'����/��G���=��-xd���}��fq���a����������#��e���6�;|�����Yl�I�=r������J���P�(�X����2��	�FY�c
��9�����X�5-�o�Ou��dq��DNHY��V�Z���Ee�g~��%�+O{LJ����B��M�d���R�����Z�sG�YT�qg����TC�"s�(hh
�jT�LV
%�x��Os���Zn��| �������~�H�5`�s��:���
mB���G���:8��[����
�`���YL��z�`R��f
�r�Rd&[H7�����F��*�\=�w-)�����b!�&2�=\;���=��FC��!��9�G�Ia~���h�L�Frs
��E��D���VG�_����`�V������1Jj�s�U�Nwd������B.H���%���(��;���3 ����F��-=�l��$=�����%I��&-���s�Anq	?����b�����o�!��A�
��V�����7m0j��_p��F�dS�SD\Z�\L�/%����C����+:�<'�8�p�>��1s�a�6i�saJ�O��y���n�
����^!/�F_Jkya)k�.z`������c������--<�@����JI�C��4�	��)XG;2��]����a����g?H��r`i�!�����U�����;�97V	����aG��@?��'k����yl�:m�B��~E0���fZp��:�Z?��>��sQ���k�����x��&��:��@����|�s1H������c�3�SV��zEG�D3(���Te58�V���!�[�,�1�U)�M��r���M��~��!_�&2g���-0�N��E3(M��!/{1����.�����dAi#��(��_U
��
�K�:��V������+�gWT=���pER��<����)��
���sJ�k����N�r�d�zE�9��^a�b���a��������y��	J�Gy�A8	T:8�f����\[([u�d���	��r���C���4�C���@�
�U����+�9��8�|eM0W���.��S22�j�%�9���mV93�Ug;R��l�>��.��%@�M*���O��U����m�pq�W>���s0������E��������:�u�\��9����hW������!g��B��Q�e���'o�!O�r�R������
$�I���![N�,���;����#t����q�C��\������a��54���SxV�1�a�J�H���i�1��3�C&���������UW����#�t�tG���-t.Vk�d�V���<�X��Ys�����@1�+�K��u��9�pB�b)da�#a��`$���3y��CF��B�G|V��Cn7�|��[N`��'p�����M����7��$��3��j�5��Gb3hL��J��9�\�����*���,�
���u�b�ot���o��F���#r��A%�n`���n!GwU5�mN*���?	������������u���:�;��c�\LN���&����sa%��b�g��2S���uhM:�:���,�-A�C6u-rw�����p
�}�b0Z�����5�BC��!�M���O�����l@E������sn�2:��Lw��M`<d�%`Wd���\63@Q�a�wJ�\f�I���3��59]
�=��E.�:f���\1G�I�]a�����H1O Ho5�s��	��;��U}'-~D����9��MvD>���.�y��c����T�p��7�������
�|w�:rq��������!�
bO-@�$����0,Ig����u�w��A�:�j����L>taD��sQHD��:�V�O{$�C��5d�%8U���sa�'���H����Q��V��G�	���)�F�[_Gh%/��H/C������K((�1�y!A�*zh%'v<������C+<xn:����C+yM��f����!W�>4���6���F�sf��;5v�;�C��J%���i��V���P��x�TH����H\H
z9\���hl�
�
����6i�r�W���u�����5��e�7���Y�j�c�����3���W���f?���YW�?~�����>da�V�b��+,St+w�����>����B�#�k�����c����V8��1kwy�	ej$��m�9���e����^1��VF����|�!r���O�m��<[
�3�C��dh]@�Cc�OvP��y�`w��"������#��U��=�L��c��������C��
�"Y����pE]����u�G����*�88��'��)����1��*�A����3���<���p����[3u����!�5p~�L�z`�C���>\�r�=:�������J�u/��!7H��?�P�V3�Vr���A�ta�[5�{��B���!7XU3��L?d���mx����z/���r7��"��v��p��r�)+Y�=���WD��,;���=��w��V0�PC�>����j!v���gDu���D�)���P��(��B*�&� F��l�����,���(r������xhG���l��2R��5��+z�l$_��`��D������#k�d���2���Ba�G���=*B��v��%!x�6�5�>���|(�S��$�J�I}�3�O6DQ25c�<U��t(�n�`�Q��m84�wci�c}�c�����x��Xa�c��$�t��)�d���IFH�KE�h�3B���L��6�S�Xj8����)�Q%_g[d��Z���]�%0B���y����{�c��BH���40�c��j;q�Rt���>�`�w��n�y~��;��s����}���0
DM�L�+95�,7
�����g���:�p,�z�5t1��F�B�p�&�1[m�BdV�"�]r�>r���X���iQ��N����75���RPZ��sWEw�r+�lv�����v��U^Ij��Xr���u��/d��c��q`K�j���{M��I�����V�M��a,���GhZ����a0!k�M��;tt����0#3�C0���ZU�jZ����,Z:FR��?K-�mx8]�h#4-�B)��7$���8?w����p,y4�4���X��������G���qKv*0��M����rSbx�3nu����!������}�Y���A��5���c��w��i@��Xj8�<�-��'\t�GN���:J���c�
���O(�en�b���#)��jI��P��c��MC�����j>��
MC?�=R�Z0�q4-T��# �D#4-nV��PG�����q	���
���L�Y
l�
�����P�uti&��T������p,y����Q:����1���G���t�.����p,����ORhZ�
�ma�+�|9SG���t�GhZl��/{�J#��T`�R������|:�n���K��}�	�|���XrT����
����^�G�&��p,yl�F
�t,�X������}���Xr���[��U���Xr+�?VP5�G���#�*RX
�<B�"�[�pc���3�6����x��ia�_�����8K�#��HK��W����E���H2B����p��"��M���I���d�%�D��rS��+��?��
��Z�9�<k�>YF?���	,�v2�5���X9*�oO���4�c���L�
�L��p,���\���%�6��:�����$8�w\�:/�:InI�� ���=B����0����������m��o$0Ai2SfL��$��BQH�#4-LN�h�����chZX�Z���:�����a1�:p*�������hF^�
v9�<�����E�����
e���� ��	,��>���1vD��G��l��B������%���E�HM"�<��������+i�E:���HS�*3C.P��V-���B��1�f�g}�b[H���ia�8�`x-����F���Kl�[������#�@,�1�_���mn���d��Ay*���ia�p	�D&���1���F�V�"!�v T�����M�1��%Oh8�<��E��d�`���"2�p`���y6(��i�
��r�!m}'%���j(��[�uC 4-*(E��W��yQ���V���1��t���!�n��sC6�7y8�d7*�.��Y�=9?N���)��CF���c�|�C�����K��
�b������������jD��������4C��`��4��L|�M-�DUT�r�������T�3��;w�(�7�������M��n-��pg`�k������^�yOx����z�����.2������!�����WW���
�	5@�e���rS�����[�-}z&�r{��i���MC��.i�2y9C������Au�������z\�t�����{��E�19����+7���������M���P>d��_}�u����9�!�zS`B�[��f?u�@��ZN��W%w�&�#l����BS?#��c:������\v��+�D=#6.�\	M�~s���`V�1��x
PY\c'�o����������������
b�H34-��V��n�\R���H	:s����\hZ� �����C�b���}�_�;ghZ��q�J3h���M�h����+&f9jg}����9������1�����:G��uN�i"@��y�8�9���u`��� VKx�������`��b���Mm�����E``�&r����yPM������o�N��@p���������y��&��[9��=����Ab���G�Y7y*=����M}>�:�sB���=�s������+1���E���c��<�M�rE��9G��MG���:�E�����C�@$;�pDL�����J��5�����Lv��0�wy�����u���;���j��e30d#1���G��!��A^��l�^����!�e3k��	_n:�\�@BG��"���c�������{q���i�n����u*q����XH���$�B�b����f�*��i�9[6y�<g9���$��f��IX�!�I�T�}��WeNA���2�ry��������ByS+h"���V`���6	��-\�����@�����o�Sy�Y�����\+0��1�
��
����<���^[��M*�����"O�C�o�����!��r���S@�zD��Y�iQyF�I*�.���Y����O����n���	�z����&��	\�!�<�?N����C�V!����e�}9��sd1y`��|9��*�9�;�$��esDi]����.�M�Ou�gUb�R\����j$rC���(wE�Z�-�c��&�t�zbc���
����PY��VF����f%!�T����S��:�BVz�=~j?�o]d3t�Z�z����.��h�mG�A�U�nu����(9�F�L��DW-��c���Hi
�g��P&�L��M��!w��zy���9v����T�dM��p|��r���MI���&����(%=���,V���y��%��.���i����Cw65,O�E�� �M�n9��H`�������
M���D�XUn|���Yj�a�}����Q����n��up�&]����M�#���/��w����J�Z�i�����yhZ�������z��>~V�!/
��E^
A�;�M�H�9�\�R������&p�r�Le�w�!��N��:t���7H�
E���������qyO��Q��H�e9���&���H�SF�t�7�Tm4V4-�8�+_GhZVV����VF�W:NHi=hZT
�pc��|4-6��6���V�eaZzRY��A��k�C�6��$���;���C.ME��
Ih�-��C]�S`�2��*��a4S��+����X��(�[�iQ5=V����6�����u���C�����ew���i�N����i.4-6g�9�_��F�[�(��B	�
��C�V�hBn���ry�u���l��!4-&%���
�M�B��d%�=���L>>�$-x��;TMq���9;�#[/���A�����r�~hF�Q�B�AD6{i]�4�
����J
�2����� ��DLJ6`����Q���D;��b)���:lx!G�"S��������i1Q��)��d�;��.Yz�p �_K6��������������
FP�����l0;-k�sv$�m%�f�qI��G��:�H���?vHZl�STsH�f���"z��9%����m��M��wHZ���t!f}�.tX�6�8�����P�H\H��)q�J�O�sz��a�vp~u;��1)���ip�8�����/�������I?Um�!i���v�F0dxf����e	�Ngg���C����S7ObY��3U��$�:Dm�C�-W���!�4�wHZ������9��u4
��7��b���b�<r��WnU����q3�m�I?	r������B�Q�.����#(wt*uI��A/#�U�C�b�I�a$�@�����P���P����zWn�r�'�=rfA�����pbx�Y���J����G���.�F�I�I���d��;$-VC����5�;���D��&>[������9[X9���V=r,�k�����pJf�sAdJ�}���ke^'���=.?cw�������V���:m�q~2�jE�R�n'�T-�]u!�!ia��~����C���<�5�^�s����T9+���"r�W�b���	��|���M��Cd��/�8���������#�1,�l���d�c�|m�vBK�:r^HV�nZZ;�<M�O'6;w+���l�V�;'��u8�<Kk���DN���M1qZCF�p���>�k
p���(�����G��q���%hCf������o�{����/�.����_E��&d�������fU�������{�J.KO�n�6�-O���6���-�I�E}:Z�#u���l�����bo�u$���z�>{����;rX�m��$��/�J^f��"Z�H���,��%J^7�e�	$�:��LgPGYh)��{����!��V���L���I�����:8|���!i�3&R�~VG��P`���T#r*��2�4��1�����2$q(�,��j8��:$�C��
R��C�\��9�����[�� x%��9 �t�:r&5�� 
�AGn7R�q�a_>
����+z������W�g�H��n"�6������lO5h���ns�$�V>��ao��a�gr��"�� �;,�L���,�eI��c ���	�a�a�pz��z�Q�(�!�:]���H����4l�	mH�����e��n&���/�U'���
7
D4����y������sry�JS�[M�te�7���h���)��O��@�2���>�N���@���{�����+#���Q;�;]Wz��N�tgM��f�4M%/]�=rJ���twc�}�=u��4��$������qX��X2Fq��q��.�� ����^%Op�xBG��Ov�P��p��3��Y�����)^[{�������dN�Ik
��$q���������sBX�v<!�����)�J��������0���/���iN?g��n�y/���f�X�	��D*6,���a^��!���0��P�����_�o%���m�}���=�����<�:��7W�
�D5��V�^��r�����7P;M=�'i�]+�tJMi����{Yh^t��d�������%�`�����0p���j<���!%:�tl�-��#\3��nE��K��SO�lc���2\���Aq���A!��+�����Sx��� �u����<f|7�4���5=������8�_������>�-������E�����W��s������rZFn�������k��'|'����l9�c�i��X����f�`0w�0:v��0/N
���
�M���r���J�!��+��>S^�)��6w��1������$����8u��F%�M���3����xj��W�����@AF��zY���="a"��F�DX�y�P�� F�����HSs|������Tj7�[	�<1��;�Z�:]���5C�5'`/Q��	%��,'}+C[]���O�6�Z��(����B����o���4���))E/&c;��1�9��
���Oi��p��gS*99"mG`.��P}H���\������<P�������SczTE�����k����g���Cj�r�mX4Q-��sdzSE��HG��	�@�wg������)�{(+��������&:�cvhpc��qdE�r�9���{����KG�����������(�W$�z��b�=���T�p-W
�jK
�����Lu��w[Z�����x��-2v���2���.�[
��)jA�R�#)�hw��<�2r���R�Y����9i�8�Ll4!�<o���#)��O�d%��%(�����\��w{�!N(?��#��$��deNEc�5����9=D�������������V��O�f�'n8rR�}�h��x�)rR��x'�#��ki�$���w
�]rN����u<-<p��)rR�����,��)r�Y��U9���[C"�b������P���X���Li���C�6�&bee����F3Y�*BY����I�Q��a��F$
��7���)��I
�NU�"��dz��j��V]�dS�CG+�R#�L=E���ty`i�n�i�������D^2wn}L�E�~5�e��"�}V���Cq�0 v�>NC�HnL�n�yo�ohnXb�
������k���>vUy@�R����'�&����
�\"�1��G�h(Hf�����R��r9���4�Ao����z����X��cv���I�lp���I��Q#-N�r�:ykn	��3��������v���S�k�V������1"_�r9��y���!�8���i��6��$v3>��dTZ
BR��qT�>�����9�kiD�N��sV��[u\�6�h���D�[D��������l�W�Hkc�����'���NY^�!����
Z(�c����"���hD8���6�+
�
��k�k�O�B5�
k<�#�\9�9�}�Q�9��_�-��8�ed�9io������G$Y�DrR'������00Z����zJ��x8����gL2$Q|p[�R��B?8����O�=�s��@;����EE�C�U�� ���3����gmj���G�����1��c����$+0*����p�Nq�b�+]�N:]���riD��i���m��L�63IOx���j(%������[��m\5"��j�`\c0<f��'��R���&�+���6_v������T��w��.�	F��@!�w:2s4m`�n|p�{X�/�t��o��7��~S��5C�j�c���C�c�}��~�	��9%�<v�8���pBQR&���K����MDP�VbB��+j��6v6��+r@��<	�xPQ&�l��w��>��g��<��Q�S�{iw1�����'c:����-)Dx���L��
�����@�(���j}A8�K� ��l�W0A �@����U9��}�$���n,4��	$v�*�ZY[�9�=�u�-����

@��Q]���J$�
,���T�����B�'p�"���-�Id��[�����zB-q��M��~��n�����o��OzV,p�A)���B-<��g�Z

����2|�8�Yo���+}[�t�nh���F.!�1�
�m�T	��I��zA��o������gs����O��s	��M���Rm�P��O������8�=6����������M������V��r��18:�z����c:a����q ����	M<�������+V�t��[A��Zv�P��4wm`�ee����,�+:m[P%t?L��>����R�������A�M���v��v���������������a�4$Nu1rqD{6zI�W��r	3���Le�+�g�T��Yk.����_v�DJ��[M
q��W�tv��E�8��6D�agg>!�0���$7.9��:q|�����!V����W��w�tIy����yk�'f����14C�qC����9o�g���f����� ����u��
�_G4��
Y�������L�����&���IPK����6������J(K�*;����Y�yi��)���)�qJ�~���t>%(�$:��O��6���2��8��+�	z�`/X~|�cB9
%L
��k�+i4�PQ�����[��
O($�k�\	P5�������8�6Rc(�
����s<�����R�EX%�AF��}��P������`N����6��(�0�>�������'�`g�����V%���^m�l�n�''����v�C�����N	��6�ksF#!�����1w��,l��[���x�q��@��!<�Px��DT��y.�b�H�sPc�!�5���U�V��5a��l��R!�6�E1GKh�L����bZ%h���>B�[,Y������:�W�^�4�;�2�^������7�H!r}
zr��+��	a`h�������g���(u<u����Z����9���(������}�A�y(��D�OL��[	�:����C�.���^d�!���x��j����8^�-������x��]�AO!���P�6,M��T�i���m��"��+;�j=��Ff�FB����$$��~`G
*�q���|;�����8/@,\���r��pr]�>��0qZm��R	����+Y'������ u�6}�o��N
)t|T+ek�JN��jX�2:����2�7h9h����q���L;��h_�����Z�:~�����b��v���X�x$�ia�x��:��9T[��S��9��i������������xB���	]�]��N�P���NZ�����C�>���5�7VzMm#:�5�[QC�ke(@��=�'���15{�7J��h$8N��}^����2�f=
35��chg���_�l�C�6�/F��V�������O��.
�C�u�|���xs
��Y���>����H���p�id��P�25F��@�����������A��]K������_�&~h@�5���n@Dm�T�����,�v���������fVf�(h�"�����u<�����C��m�7ST�!^b��^���!^[�DGP.��T�5�i������K��V;����QM�\kt����Grh���9[�	�P\jSk��V�
W6��_'4~���������N7����T��a^�����e��P��h
��z��}����v[j;��Q�u��������T�m�o}-B7�d�VA�N�&��T�oi,���W���H����@����)��L�Ko�|��� Y'5��il�2����b=���h6D~:o������\�jp���dC�(I=��A��Y�o����3aT�G����,���<�����"��tHC�a�d�o�$hk@�}���y�hv��m�Z���IZ����F:$���D]�9��d����� ~o�x���^O,N���9#��k.J���{m�cJ�C?-�@�u2�����$S����&�R e'�y��x������=7��`�7Tx���8��0��q�u(�zG��B��P��������P0��I��"I��:������0�����^�.K��oVL�a���P������=�TF6Ee9@���z8��6�����p�[�Y�!�f�1���Y��WC= e^����D�N�[���1��f���F'&�>��U�
����6N�W���n`�����������k�
�$H,
H�C��<�2rc<$l����n��C����:V���C���To8s�O_�!a��6+8<q���m��~��bi��M�������c��2�j��W�P�9��g��]�'��W�[�&v��a���l�e`��EY�vdF��Q;M��������+�J���g,����e �����7m�	U	��-I=������J]��a����H��V����R���x�n��'+���nx�GV��d�������Jx�����rd�N�S�D?p���wk��Wud���:M���0w�< 'lRy���q�(����c������\']e+&�������o��P��=3�.7��5�R@������K����5o�IJvi0�R���r�gt7��]C�LU������5(7n7�oV��������>La�LNwn��kGgd���@Q�k��@����4����&	�{V��vtF��T��oUq�v��9�s�������e.��K@�����8��
F����N{�Ej�Dbh�����?�H��
�ha�h��:H�y��G�v�n�Uj�x/>>;*���	���p�k)F�������Fm2�(c�y�����	��u���z;P��}���>��u����M;��=�J<2b#���m���������@�W�����(�7��qD��������8��{��u��Mp'��h�t��Q����������;8,	9��pn���f�P%�{��JY�i���{��`uA�?87��>"'�����3!�����=h�T*�0|�6����{�H�87s~C����:U��F��������Ug��%}�N��vQ�!�j��9�}����*��jf����Is�p=M-~�6 ����n��7������G��C��s_e���3:���:�kQ��{
jR!����^��szS���l!1R;49iis�9���D����J�1e���2���(;Bb���$8�'c�{�������\� w�z�7��!��IYS�$)a��HE����
�|�!1bU��9�Z�S ����kb�6)�������z��'����������!����a���IV��_;�<�H������J���4��[�,�M��jg{�� ���#���J�*��������P�p}sp��G.Iw7]9�ZEm�z+�v�����ilT���6Qve�$�Jj�"�{�H(a�hW�w#�tlil'�"��;���r��1��LoJ*|�H��tu�������o1�T���m 9���O�N�����T�����I�Gda
�t�Ye�K�������S�D�����tm�J��n1��=�E2���"+�J��LQE=�����R�z�t��`Fn��HG�a��3*���^���tG����r���l������6>��J}B�i����t��A*)��
����
��ke�e�[)Muq��=�ELC��
����������>�F�d=��i�qSn��YqM����7��;g�-U|������E�Z|D��nw�NR�P��A�G_�����b�
l����������f�AE��0o��V�D
lh
TC!��X���kN��c�:�����P�59�q�'A9H��B���@�0�J�R:�P�J���B�j�����j���=9�Jg	��S�Mw�����!�d���C_�d���nizZ��/r�F,�g��������zn��/��[xl���"������Yv�����R0�����C_��au%]������_v��i�W����A2�b�!�	
�f}���{`n��[������<�n'�����n������*T	)`[�����1f"��e;x}��<l�<��]w��4U� oey6Q\a�3w���;v=
V=
��C_��@v�x��;������K�q�L�c��RUWB�I{�G���b�tO����M����k�)v��5t���w5X����f�D�*����Ha$����1�EZ������A�h]�&�qz�
�ZO�}���t~w�z���D�	�>�C�#��Vp]�U����	�Y�7]1A��[���.V1��/;U�	��v����5�o�'��	��Me���g�1��s��gk{�^mtU�*������3v��a���CkM�D�+����!���	uE�����,�G���s�z62d@��+���&~g��3���Dh$������?y���VB��7)`Uo%�t���i	���:{Y����R9�
hG�;^=55�I�>���2����L�xu���1�;�'8^]���Z��g���Pj���[���V��O��
�%����^9���i&��;Co�T��j����Q�A��������j�����9Z����+1��<t�E6��74����^D"�%�?C_$��S��Vw�z��}�V|e	�������>8�8R]V�.�>���su����
�>��Cj�A���c/+�o��4����s��.��lcZ��y�>G������5��t���J�`Cmj�q���+�0����3�^VI����Y�SW���6��2�6�f�dM�G1>Yo%��!yC���#�E�M1��[����y�}�>��X�E$��=�^��L�-��q�U3l!6�7d��8�2�V�[���:�^��UA����lC�hU��6�e:�5�^��P]9���	}8N=M�D�&a�������\�Po�5�~�Z�z:�&%�e��
�����Q�"����:�7��q��)����eT��Ip����Xy��`r�3^������C���c �q��"~r��q�5;r����/�q�5��b$�U�_6�~������.��u'��M�GF��g!��u�����d�����������'~g�`���,�W:����wv��m�����N�ZY#t�S�z�+SOW%�	yY�SOT����c)3��L��q�n�=�>7��7�SWt��]h����p����+�&C���q���M3����'�ZY���I�k^3t+���8�2�P�0��R2X�`oEy�V3�2�i8�8�s������$�&���C�#t������5�+��~U
���������'���N~���/>!n�;��l�.�whoo:�����1�,���8��)Y�t�a?������C�0������2�F��mS��s&E��:�����0j
�{8N�sA5>���:�^7�=~e<u��C�L��Q�W�8�����t��^�p��U]�+�x���e�z+�	2�^���3������y�z�Z�����!p��3��57���1x�;IO�~R��H������
?awe�o�5G������&=����������C���q�=3�����,�p�z���zp�RQ��8���o���?;f������`*��So�����p�D}������)RG	F���-�o�o9V�C��N������������,
oe�^6������n�dE�,�D��d�I6�6�)�nDM
i�:����X���J���3��dc�����d�=;o�`��{�x�Z��`��{��'�8�6���-z����������r�z��1?
����Xa@z8N�7{tch�6��SWS��y�tr&�O�~L��_	_����du���l��F#f,+g�`�>���|�_�<:�1d��+=��J�Y_j���;C{�/cP�I��3t�EN'�eVz�E��-~�2U��������2\3����(O��V�3pj����v���c���)���y�O����\���:�
6�lC
���"��a�^��c���:n���V���,$y*�^��.���tS]<=]�Q�o��S�������	�A�y�?�Q�f,��n}7��������t�z[�N�P��'�:�f��w
H����w����V�Fu���a�>��1R���f���H`h������i=�"1��wq����:���O���b���*�;C$z�u�����%�h��"20�^��F9F�oT��H���8��P��=(a:�7]
^������[�n����F����;�yh��(C�m��5�dv�6L[���u4h&a<C$����j�9Hb�5f�+�X�\T�ym3�S�f�!�3�v��Fg�8��+�]73O��x�4H�I������y�Ar��%���6�4H^�������0w@BVY3@k#��_8�`�����m4���P�8h����&��%z{��u1�7��!��RHd����U ��1�R2��Q���oI.�G5�e+|�B}E\����A�,Ee�[k�7��y�AL�U?TjM�����iF����b�UIB;����v��>5�.f�*�7��j|����z��r/+O ��@�Z�����@j7K��*66�v��;
6�#�m�P���>�H4��R$��b���[i�2��d��fAY���FC`��5�F������fE
3����0HO8���y��v5M��J�E���H2�c�6��k�HW��^��:���&��M�W5w]�K]U���UH#���� 9wmg�*�I
sU�m*��I:�P����J���pMmH�H������\aO](b6�h�3��&�6�A�������G�:g�������Rn~(���rE2J��e�l���Q%:#/I��b�:6d��yu_�r�k9��AH����ty:�3e�~�cJ�_I�������Bd� �!> ����d��C�@
DK�J���$��8��"@�~b9�6�<�y+@l��Q:�^�����s�����'���+y]EW���"�RN�	�b���2Hq�{t����2H9l�@�����vU/�+hAk�����]ev
�6��=)���"rY�(��Li�
	>:���g(��{������P��WY�d3(�yJ�P(q0@�����:~|����KVB���~�d��z����=��cQz`��h�n���.|�	e�_�l�C�^_�l���&m_g@��6�'�@�bsP��l2���"�)
bl�fp>�6�A��)��Ki�3�RE� �����9�T�c$�*�J�\2��*V�4H�J����N�J��+��&V&�z�����a����=%�(|QO��6 �G������
����<c14�����m���J�:�W�2l�����2lhU`�v�v���4��"���aAC�]��
����5�-y��s9#]��*j��b�IYN �x��'��r}=+�a�C��M��30�����
���=0�y@S]�;�b���N�G�*�|B��S�o+��
������������Sd-�-r�������b/��4��@���f�y���dc����YWD(���u�te�~T
4�I���zZ�|�K<E������:�L&~g��o�FJ����!w��D
�O���'�rR�
u%
yO�~�Yi��c�����[?a�4�/+�=���4H\�rnxf��f����Li�6��d�
V�|&����
W���b�k{��4�}V�L���:#�4_)�U�H\?72N�u?�o�� �9�Y1<s����3�A����<���)g�r7�]��0<`�f�":A[���Z����a���h�09��ARl��Ku�\I�����Y�I���"{)����HB
���9Q�J��|b�K�PS/��,�*)
�7||E��S�4�d��M]��� ��>�x
��|e<Dy6���6<3l� �J�G|;'Vb}H��U(��f'��9����3k�FZ��A����z9%�
���������`�o�Vu�zm��C`�+_�_��t������>�'����{���X���&q04Sx��������!�
�zvJ��l�=]6����d�]�&l(��yd��qW��N"�'Z�
*��Y*
��V�)�*Zb����4����n4�u����9��J��_��<�A�#w(�"Nj�� @�����Y�iC��!�>dt�U���|f��������������*D�T7L�>l������6H;�#J*FQ��Ni����(��te�P�X]�j��V<(QIb�2e�������aC{,�&��E���~�v��L����*Zh�N���+�rM�vf�x��q�^6����)
R�r��j`C)
rZ?�w2���|���f!w(x���w��LC�P�xfJ���e����
;p�q2�Re�^g��Ni�#1�6�lY1�8u��68�C�kN�N4����N
%�����r�4%����lw��~|�6��V��8���8��q�-�������'�6����s��6)�����4�A��q�~����$������)
bvyMDR��)/�����~���vcT�;�3��S�}�+�+�,Y �di��5-�D����k�0�2�P����V�#n���� �d(�������4�5��60c�����X����v;pj�$�-Z���z�a�������8�F�����������taN���,��Z�N�u����C`w����j��
�~����;����v��"���62���&�8����{���z,��,H��r�4�/���[����w_������3X�K����FQ����;�/�� v`vE�)
r�����=K��S���7�Ec��z�?�8N�A;p����5�62D�S�30m������=���5�;���,��C]��^i������ae��6�\���;	��.���V�wJ���zj���~��������'�4��f(
�|B���)�^��C"wN������|e����`��gb��<��P~����Go�6�G$8���M�l\�-������82����ad�� ���>MN$�SGG�{�}�:�U�x��44��_d����3���`����\/-v`���t�����$�zuQ������	!��j����fZ�;��j�E�;�jC�f?ZH
/!e��4���
�h���ZRds��)�|����cp=,	x����`�������,��Q�K��
�	e�:����xb������d�jY\<t�S�kY�Q�+r	h�����@����3?���_��@,7Mc�� P�����Q�St����e$�C�{j�f!����M�����tz=U�����W��u���iI�N�`����X7�8�������7�&����2R5~VfSl�JtHko��2�R��pj�`i@8�T%Oi��J�����~
�b�����7<��(H��Jy/�'J�7�
�Kj%���]��wh!����tp#]
������gP�x�s(���y*�������C�)���o4Ik_��4M�\X���P�t�R6��
u
vzT��o�����;�r�J��V9���+�xhb���HEVU������w�+�����:y�����Z�iIO��o
(�&`�Z�U�o*,�VZZ�&P���c<5�I0R�5H���������R����.e�����vH}ft����
�tK�
e}3���^����wHwi����U}�%�������dl���e��kLP���M-�U�L�:��5�~��z��������]����/��y��&��4�j%m;��"������[�7�.������MK���3����S�6B�2�IiI����Z���_K����R�nS���n5}�U��VR@����VJ��;�"���~�V����^�	��W����^i6�������>�ce�o}u��V�����EQ��	2�������t.
����t�!UJ�whT�_�
"��j
�9���T��������t��)m(����������6jA�o5�WAdS�������mR����~�D���M�v��g&n?jLZ��Wux�@����)4T�~�@��+������o�J��wc3�ai{w������_	�C�S#D��!hI	�z�<���"�
ip�C=�+!2y�S��[^	��/��������6\5Z�4�����vb����h�x��k�Pa�)"b�6��;���J�����kMM)�n��,������RDd�����&���@i��}?���#���-K[(�L���zE�i������C��oo���=�$
i��$lL�g��tT7���7P�_��'
J|f���m��CZ*2�������	���� Q��t��;	�N��������\���S{S���m���K�H>-a5�|)H�H}�~u@$n�s1*�2��e�^�����I!�g�?jK�q+O�R�J����g���h
9�!��V��44J��
��r~g��;"�	�^s<����D�;��Hb���Vd�)�}z=4>'�Hs�Z����s�Iq,��s&�B��
+������"��]j����+V�3c~���������]*$��xH�Pw����6�ce��4��������j�k<�I��k��P��l��������W<�����N�k5�C�
)h�����i'`AQ��>+��N�d���
�W<DIH��	���xH���,i
7k��t�d��x^�[}�C���pl�m
���%��S��ly��h����R4�_������VS<�*�����';���;�����QM,�l&�U�l����!��
-XB@���<A��k���]%z��7�z�9R'��Bsc�^��~�������*x��j5�C�)�}�%]-!�C[������iCTR�p��o�$e���3�Z|�������F�U��^���(��{�*)2�j���+_��gv�>�[WFf�(��E3��n�23�����Re������J����#�m����F
GT��K��9��u^#��~g����R��4x��g�4���Ra+����3DR�/���:�AEu��M��EQ*#��m�.��m����$��	-�}X��>���CB���
r='e������z��4��!2���n=�A�P�W"x�#m�����5�'��a	��M������h��tMGT3��? ��0��7�����D�R���1uA��u3�7����q�H��s4y/�`3����w6F�
���6�"c�{���������+-���B�	8eI�.����\9�wb��Rao�~�I]�hJ.�9�q���5C��������5��xf�������7I��r�-#����UJ)4�+�����:��N}D��8��T�3mY�����u��nK9k����ae4\o�0�c	���^���l�`��W�[��XH�:�aTr	wv_j})�.9+���(�������9���xH��B�'�*�T�����n�&r�D����{RZ
xB��4
,6�v�jR<�SL���u���x��&�������',L����!@��3�R�0t�I�D����h��Za��T��9�	�T�B�9�q7���~l�?}�����g�0h��~�d��C;�Hu����A��1X��46WR�mcL��_��%;�Na��@m�lu�15�6�E"b�z��>S�����C�'������lc�d���Z��Ox���R�P+�-�����u7��2�eW2�w<3�!��b�V���h�c���x�W���"��xf�e�x3�6G7����0@�N��JK
`o��N����a���k���~~E��^%���}��o;�l<�A�Q;���N��_���%�������
H~K����t��u�����$'Km)r}d�%����S}f�����*��x�ae��_2D������%���c�+.j���^�z
��T�n���2<���-p�58�")���,�5�|N� ���!������p"�]��+>����oik��,���(��o
�U��.�
]8�~������2��1���Y�?t�����zY��.�|Rr?��,�!�9�s{�m���9G��F[��k����AM�����e�a��S�iq�fv��a����(��`}8e�S_���R����SS�� ��+-�	����j��S�#���f:��[����C��d������g��������Tuk)r}U�*�$�MR��j������-E�g�D�e��0R����8�������%TGYZ��)r=8^������"�������dk)r=;eLy�(�����A�7K�bo-E�'y����o��� ����Qk#�����J-�36�+�8o�';p��k�v���C�N68�]3�8��>�"c?>BW�t���5e6�P]k�S���qbh�	�RS<�v��l��]�'Z���w���57~g�;6��O|�`2�G�@%�;w���8�U��n��R�h��{�x%�*��Z���5��e��������3eX���Pe^��o���G�\�����]X��e����N|���"�]o�	4���X��e�-�*Nv����m�%�l�S/��*��:��+r��#�K�]�6q�}��_�����hV�����LqY�"O�	-����O���5Do�S{w��;2[��W�qs%�q|����PK���kk]���Ze�^wh�����4�v��9�q�mo�p5�*)�x���|����]GwL��'���)�����-�&*w�Z�������8��^����/�����������E8�/�}9�d�W�����)�H�b�j,pjw�P���P�"�� �����R�z���1�F��"��;��R�K������a�
o�\��`�-t�l�mr�
9
+�]�J�e��rH����]F�
��5C����K���������w�3r�,qjS��D�|���m��+���L~���W��\Aw�%N�x���i������E�a��S�3�[��BN�6p�}�m,��MoA�����^�������}������zE�=�P�����4����4�z\b�'{j��~Rt���[P�Jc�(��<S{,v�3�����q!]��,E�+�w��Nsk�
��6d����64\#�E����9����U*R8��S�n�[`�:�Y���h��b����
���(����;8��_��{x����L��`����J����A��4��\q���!��,pj����w	�X��}�$���U�kW|���C\|���N�����������C���M���^�vT���z�C�����A=+�S�~�+)��#�y�`�j�;�9�H�����L���8���Pq����a��;�g���F��X���R���v��"���Y��v�l�����5���6T��^���3pj?C�5_�m����MLjmZ~gvK��=$��ba�S������������8����~�kk#�2�m��O��U�����u��M&�����]������
��0&���ad�Y��C�,)��+����GbHz�ZU�J���aC[�����Z!���O��! �����ve��n��Z���(����Y����^�+�-��\�q`�������~���PFo����j�������P��z���.�Z|*,�<�>�8u��f
�f���-�?�v2�6r�-p��5��������8u��a�
+�
��G[��J��4�q��o�>��<+ct
)s����Pcl������2��Qz�U��Z��FQ@V�txB���/�f��:��Y������m7<�'>�����T�^~VF�#J���)���q�-�@=+�S��S���E��8u���8gC��,p�q�lE?����C;�e����0J�a��C�r����64��8��"���GG�����ge�C��8o���N�`
�`
����P��)h}���:�Vfz�~ 69�
�x��z����q��"�=p���������3<UW��7Y��m�iC���ZdO=���d�;4�'���1TS�:z��GO�N�Pr?
�9��R�L�G
;�Uo�C��Q�W��?P=R�GO�������2�k����Qs�B_�R����G������z���h@�{1���S���`�5B���&=u?*��A��e��!J�p��������,�z��Ol"{{2���"�=p�\����c��@+�*)b�t�����������S#J��������<������
Q]����������;+�;���LU�!7K-!u?LU���zy���S�L�m�O��v8u���g�.��8uO������+�������Rq<��W���7�RaO����y�>���Y�����FNO���9�k�����G�����kz�z>����T��S���B��aX���]{����8��������+_����@�CO>�`�<�)�2z�~����S����O�y;L$����~n�}�^�q�;���:kj������G�]�w���
����H&�����*��7�eZ!n�	��X������X�5���S�����������2��Q&TgP����R��*�
�@����q�B��V������^=3���H*p���9�������z�t%�;~���FT��gf����r�^��|j��TA�a"u?
"c���VIz�~8O������S?{����WQ?8����^1���,u?69Fu�m�����L����L��F?��<��{F<���C;��o�=_-=D�.Q�|���m��mst���_��
�zv�PK�p=p����WA�����Q�&���=�8�\��y6�J�~N��m�O@�L�z��!j��q<�60�����d��hs�z�Y��+�y>����EQ���vUF���>�7��3����S_�k3Foz�%���58��N=��]E\�Iu��x�(�}�NCo����;���u�P���f��M���nr�1����*�Op$�3O�����>�,�D��M���8����uo�Om��>�MR���Y?�kN|���gc���q_�_D-����������>:{���N�X)���
�,r�]v+Q}��������m�m���^���r'����^���	��#u?*����Z�85Vndx��8�d?����0yue���W�����A#p���Q��[����A#qj�.���N�/���?7�f<3��_�U�W3te��2��V�m���
�A4}���W���n�Q���~���Z/���j���5���
�zm��2�\�NV��v��~0N�N${��C��!�J	�6���g���z��������Gu�����$�q�3SV���A�����F��q�D���Z�P�y�|���H�#�������4}<�6lhr�I��w����8'��k�<�O��*9v����{�O@�1�^�via���e#p��~A��Q��r����{��q����U���j$N�t���Dl�E���kA7�l�^�2b�sg�\��H��qCpj��]`6h$5��x���'�wY��
�D����_����k�����������>V��0GeN�DX�0��)���H����e�����W��"�1��u������N�(��zur*�G�����������m�e�+F�cX��z�+s%&��+#pj/Z%9�+�C+S;�R���t
{���D����P�~l��JaT��o��(TN�ae����`�Vkx��CsB������8���j��n����tg��&K��G��
>�0;��2qN�kG�s�);��65����!�iCs����g%pj?�U)�*1��F��~�f��5��uN�G�V&���3pj?���1�/�_%pj?����-�c�2u?���X��_��l��1E��/p�u!�u^������|���U/e�����W]@�G���y:������/?��C�G��sAY����~N�N���sL�k�	�O����k���9�O}:d������8e�O}y�
}j��F��?:.���`�>���Y�n�D���v�O����N�U}�o���8��������T���3��j:��o|j�	��O�9��z[�!k��j;�g�����%i���1`R�S7T��������N�k*�;�����n�]T�Cv�>�/�'�ae��5�[Vf�����R<�j��;�!�H|�%�-2�������K)���,�
t��x����A���$�^3u?{�.��������LM�`#u?*{N��ne�����M�jL��Y��~a�n�����s�>����&3p��9�����Aef���;��Aw����s���}Z���E�p���k�����r��H�L��3pj�Z=8����wN���z�6��	3pj��^[@aM��8�?�:%��iEqN�����9��geN��m�����ce�J_�:�`6�2�\;z��������bm�
0��g��'���)�[pNm�Y�}���+��<+:����Tgi~�Q�X]Z[������k}��x��f�����g�)�I6[��!�Z�~^���6Tx; J�&�liCW{���1S�c:����X�������
0M�Y3u?�]���nr�+��Qu?���h�L>��L�����:V��{�c������
��*q���T���S��~l��A$5-c�V�M0O��2��q������CU�-��]��3p��?U~O��6�cT������*f<S��LL�gB��3��S�K#g��8�>��UW(�i?���w6U-�P�/��0Y����e0�n�����gu�[��v�>�i�i�v����k&���ou����17S����e������X����$k�C�S�:
W��i�����La����z�N����<3p�]/�	���e=���wu�0��68��j�����3�P�D~QQ�J��9����.c�����j��J�3�(.�Yq�*���8��J���1u�~�4k��k�����!Z<U���S��,����+�g���P�e-bRz��kC|[T��b3p�]��Z��t��g��}�&��C�+��,c��g%p�ur����J�6p�nj�v�����f���3�(F��
~(u?��PFy��ae�Q�������+]8�O�N*�0�M���Ui{��m'$N�����F����m�w.hp��;��s�vG-F����m�R+V��U��	�������s��#-e���S��*VRsW����A-�
T:.3p�mT�_T3C�8���v�x��CG!V}Q��g���.4z3�[T�g��3
���]�����OEy���V��ryf��V��D���3gZV���*��+�:��y��B���^�{3p�}�3�]fT��{%p��I]�8>>p���bEFw��{����I�m�dN�#��]���GQT�
3t���O�t���z���y�|�kJ �+�LgN�+U��V|�(������)�K��8�l��W��6�@�r���zq��s���������	�����A����4�_�S��:o�j{��w���.���=�,t�-��[��Q��������V}k�[h�/eS�W��s�Y�����.dEOh�/�Z��QX���BYx���\���N!�yW���5t��Soy��k��2��+q�B.�Dd�TIc�����yt��M�5U���8�j/���,2�5�Y-m�9�j�=
+_.��m�o5�\����t3&y5�C��5U�����~u?�W��g�n��v�P#'o�j����{��+p����t)�hO��9���C������g��I�8�(�+�,����[��dn��}���S��,����u���2��T��8,i<3l���5I-����SN�f���Lg�����M���`����G�7U-�;���N������n���_����vR��_W��&�f�~������mV���o��(�����
��}�Y�S�NG��CT��������z�{�\+fvw*L�^	�zN�g�=�ka�������Z�	+p���l��B����H_Z[giN=O�M����^xf���E+<��UUq%����12��=�������&j�]���:��DN�3�7I>u�������p���(����{���}�����~�1+u?��������w���W�S�p��G���Wz7��^���HJ1���~���~ {u?�,��J����Ak�*��o�����
���W������|j����}�����������=�
wY��c0�pL�5E�V��
�����Hj!�H>5����N�g�E#��Q9]�3�����+]����Q����,1��9���I1��g�%��S���U�m��P�{NxG��8u1h]������Ql�3��J�Q,��C��EN�������^>�=�7/��4�`�������R EP����v�So�0��#-�x����:�k��
�zbo���{K}��Q/�����l�])��v��j JM��I]��i�Eyo+���XM�2���hT ��W�%���pX��0�z��\I�������~0/����%q���RO�����9���V����5�������S/�X��	<A�x�z����[+#�{C���	�8�� �O�Oo&�60����C�N��z�~�i�����=���SoC����W���k����0v�y���x�qBYZwp��<pjCd��u"�JS��S����O�-V�fC�6b*�O���)�:_V�Z<�d'������T��X��\�;���������
��d0 ����S����z�L�g�~�XS�0�+���'N=��h5�����S?���d��N	�L��-��<�P�SKh/>De�N�+_�~�E��eo�5��}W�C����tS��'�6lhR���������s�d���3u?��
�O� ����`���ST4������>�o���	�S��oh�VL4��	W���hS{���I����W�"K������)G8���j�a��������i��q�'N��Q�
�[8��o�i��Vf�^9��`mZ��-�3����1��j.���Q�>�`�A���Q����"�;��[�J	n���(W�g�N�$����3�mp?
{�6,�[<p�'d1�Lo�W�c��\�!���~����8A�g���O�\d���������a��S�"���eU��zNp�(ur�R���T`+N��S����6���
z���+��o<3���bR��'��z��O�END����h<�(��H�[�m�����z��|�Y�S;�+���,p�����w s�Q3��+��'N��2��[=e���'T��+�>��S�-�2�Z|����z����=p��W��\d:�S��<���<)O���f�5�-�S��U|�{s<3{����	��MVr?����~$`�S?���Y�������Q4��H��
?�)��sr��'N���4h����Sw���;��������IY��y��c�8+��^��D��o�LOO���~��Ivo�ACc������/pj�s�1b��N���EK�f:�S�Z�7��F>�������$�J�S*[+���1���>�Q���;|BS�������J�A��8���h��UeXy���4b�Q9Um�m�n_�;���X��K�}�~��k�����aC��YW79���S��S��������1�'��&���;pj����8�Z������T���wN�C+m���l�8�O�%�3�{;ue�e����U����~A/�/8��S�A�}e@��+S�qm��a�~gr��{E�vu?��L�����MEY�L���8	�+5�����
�2���3_��9eo�r�v�~�z��c
2������5�g�nS�ck��tF��D��]�.��O ~�6T�z�
]E��y��>P���*9����Y��;��N���,���/���w<�4��9G���q:����N�zi����`������
z���d�2����E~�jv�~�Y�z��F�HjN��(V.jX��R����M+S���bN��)pjwv<_�:�����w������{
+�7���E#�a�#�8�A�g`���SKH}��9P��p�g���S��3�$����n�zY���e]�cv�������v��8����t+/4���!�&_�4����Q6"���^	v�~�'�i_���-����N�D�M1�8��j�POJ���O}�k)�����Or�LL�
��|����	o���������e���3�����ZG����qM��,pj;�<��'�S��r��9�gj���M��T0te�e�2�NEo�W��6;d9�^�/u?�
wVB�d�wF�kcn������zC��d�8+�S���� ��q�~���C�	_���ao������N�;4c��UD%��(5q����Q�QM��������1pjzV������6p���OF.v;��+#�����j��������Y�Cl����{)�����fPrt*��J�e�J������{%���N`�:%v��v:
?���g��c��
���������GW��jvf�������CQ���i}��1����������pjp5�N��V�]E?��teh/v��q�
�W��bo
;DtN=�������8�����`b���a�Z����a���P.l�&�7����8Ss���U6�N>�S�����S����`V.�ha}�1�4��Z_�~����������l9�����peX������
d����S�i�J���������)���#���A����6Vp���@�H�<�O}vHo�
�O�H;pjw'~{����L�zQ��9U�`o3�����;���wY���E
N�{/��W��ge�(�^�u0\����x����3Y��b	���Q,�ca`64����Q��hs��S���Q�h�%�|���O�����
����FUz�i��
��Z�CU������A;}7�UP�T��Y}�V��B�*����e��I�����E�9�N	���S�Mn�3P;����;�Q�g-����g%�k�����oO�(�4&�v�JLRd��~�����0�5�gf�+���Q�F�6q�����~��;��~�a5&���L��j�P�])�J���Z9�@���V�6�-Xmn\yY�r.`��h
���!z��H�_%u?��AO�����G%�x8��d���9�g���v��vu_�Lw�����A�-�Ur��qV�`w�/�(;��������_�H�\u�����S_����4�~��S�����P�{~�����M�}V�
u�z��G�D<+��.��U���;G�(��p�J���\�e.���������	'��l���SCE�����TM�ge�C��'TA\Ng��}������������ZGg���YI������T%*+���y�v�J/<3�c�����&=k����T)Y���S��=���k.���9�TcA'���?+S���C��s��|�(ng��|D������v�m���(�����S���S�2�;�'��x��s0�v�N�~8�q�g �N��I�9���S�7c��5?�f��������~��OH���7W���3g��M��[���2���������9�-��
��gf��t*X@u�
����T��s;|��6����h�y��
�7��/(o���J�~4(�=��J>uY��v���n�O}�hb	��_%��~���?4]w(p�U
|���1�2{��;XktT��)[Ys����
t������]vul!�\�bGHc�4������y7l(p�v~�h���q�V8uW���|����~�q����6u?
�8��{8���4'*�-<3m���s�����O&Z�Yo�������x�t��~{*96���ZB����4�j��OH��q����q�r�b�[�Z �M���:�?��:^�Z��N�3�U�J�������g&�����(T�k`]��������z������
�
�l����CF�||�~4�|[�[���!��
��C���8usT�+5��v���qX@P���������n��&���>N��D
Xj
�����GF�r#��]ter�5[��:��r]w��L!s(�����q2���d<3���c�]:�CW����+���R�N=7g�#cE"j���Pu�R��{���G%�E=)�C�����*n�����u>/�����x��N��������K�����M����0������{f�c[�����&N]�f�(��j���)�eWUZ����SO"���������O���N��CM����hs^�8�l]y����C5h�����]��co���r/���3u?��=�|���&���zl�`��q,A;��*VZM���>��k��L��!T�F�3[�lju1�����G9U5r����PM���q�	5����`M���x����w&N}e��
�g���������xM�����*&g�U<3p�Z�Y�
�&���tJh�d��6T�f9z����Qj
�zm��x����Z����qz�U����[�/��G�U���l��%���������w=e���T��T�`}�S{a��F�W�����lr*\o�g��G3�3�99�J�~�o�L���Z�H.��������k5q�1
�z�<X�����V<z���������A�ww�����G���5DR�Sw��:="���842���vf��w��h�
���?�`�^����4���n�M�O]��(���e����\��f���Cg������VF��95���N`�����T�7�O����S����G?����&��)���):�o�~h`b��_�P
��tJT�wQ��*���1I��7���'��SOV�>5�8�:l'�C�Q	T pj?_�l`v5��N�P=�Ltkm�N��S�7Yz+���5a�������|&r�����y7�HT�j��~���J�e�&�S�]�'0)���S?_]b�Y�Z���r?�<?��5|O�����KF����O��m�m��WBab�]8�,D��r�	��0Aj<o�>�a	�S���ySqN��v�8���������&�So�?�����"����bZ/��S���4d�1���8������*1�6�P!';"���M?����	�fXM>�������O������`�
jj	�����E��h<p�9�p�\��|N=�#��$��N=���f�@���bZ+Y/�`b�g��7i�S?l��ACY-��6;��n��5[�����"���>�q�mC�c�����gk���25�/\]��������w)45<3����r�����y����������S7�f��s����O
���X��Z�~�51u$U)����J?'�16i���O������~�������������G_�C���������~�N�3S�uU)qNjVt�N�gZ��<�)���v��\@�1���}��2
�6�(�Z�2����S<���~t�����yvK����k�������|�q��*LP7�Z��O���}�9�m�u���8���'�f���Ti?���������mi^����;���������m��^�a���4j�S?�����>�?[��~���W���N�z����,�2������M�K�Y���@Ut��';p�]�I�����1��zJwY�S��cl����M�z��J#��x��L}����~�q�-]��:���N�9�25.}v�+V�z���J~O�H�S�zYB�n	�S�J���K����_��B������W��j�C�S�����m�lh�S����,L�iZCo�SO����]��W��S1�l����
�gG��QFV
���!Nh6t�L���5�C�A-pj?ux���oUU�Y�b�
(k��<15�y_%s�'��<����+��c��m��(O�������F��8��������L?�t���"������c��7���~��T|�{��r��9����NyV&'�u���fi�LK>�/�e���P�C6q#��7�j[��
�9=�I����K����[wYs�t���������U��M���*fK��3KJ�"
��i=���8������X���C�^�OH����#++�m�X��Z]vt	����O}�ZeouJ gOYK��Ne�|hL]�������8�@"��v!�F�Z_�S�U���&2����eB�
|�����-�����mS'��-`����{�l*�Uq>�����y>9QM���4W?Vv���5�m����>���)��SE�"�����p'�a�+C;���A���bX�64��@�\�������0�p7�C�So�~z�a}xf��P�����J�E���w���m�+3R���CQj���cGn�	����J�e��b	F���M����x��[�2b��^����+S���D?�3���=V�$7�s�����/�3d'g��Y����/��a,�m�C��G�Z�E�;G������b�R���z��������+��}�[WF����n�;a	���	
��4�zV����������2����V����p�H��c��aCF��f
����;{1��q")<��
���L�IY��e��Y����7R'*`�Sw�^/{2���zMK}�B���~����%N]YG�������s�T)��L���o�`C��aT��=�j��S��\���~�w�+;�6����0�r�����<+�����R��D5_XzWo�����+��](�ca��G�6�VQK��k��`�����^iN��
���5�ztNi�]������+�u^�C��k����[/�����Y��X���3���Z��c�2"o|���u��~�v����W����l��C�S?�GU0��d�S?��U`b�+��1yv KS�&�����wEK���N�N�9�H��1�U���^}�K''{V�
�k�g�)��R����)��=�E�6u?�I������������>=e��n��4d���3b�5���P��������k�i�4����x��%<?�J��0�~��[�V�mS���'&0V������>��-p��{emhYY�N�mco�T�p;N��D���p����p2 /U�����/M�(*�g�S������*
;�����,�+��9G��I���+��������^/D5��u���[��Wr�8����l9u?�����:�7n���93�����,�(�YMW��O��zs{���|�����N��(iT ��5t�����YUW�<3���u<2�?��o��M;��;����J�t'��>pj`�sNA���x:������y_����=y��z��h�m��q�w�f�9;�a�o���'����M:|��������,�\;K����U��?o����D4����j���UJ�C�8������P_Q��N=6'&,p���������1�q+�����(U���#��Q�o�w�����f9G������g_%u?�q.�"��z�2�����~�	�C��P�+{u?x�]���U��O����^����S���{����_�������^���qn����`�����Qm��Y}�8�:��*/Pc�wf�l3��.�P�e����2�P���>�����qO����cT���S���y�&��P���
u^��S����R?4��z���MMO�,���{e�P��������Eu?��]F<������k��V����S��������'��:��,p��!�m�%��C�fed�����4�Cg��C�%��j,�v�%$��X��f	c�wFL���
��m�.C�b�ytr`�����l�u����.��z�c���G�-?+���5~Z�����+�)3�������{B��	=u?���:*���_��a��G����O2u�����GA�q�
`������kz��m�Q�E-	>v�s�+��8����VT��w��S3��C����<����VD�����S��iqJ "���O�^�&�N:��Y1���,���M�*=p��M>|�U��aNmgn���vDR��6����gC`}��m���W��������X�v�a���{%b���8�U��]����O��Z]D6V�����@�����������W�J��~4��!����������S���������Q���L_%c�����|���8�������"�>��|O�2h��g}&�X`	����l�8�m�<\�o8uEUUT��9G��N�[���|cj��{�ns������KH>�W�	�VV|��CO�KF]��^N]���*B=p�����:���>�b�L����~�n����]���-�
j=��'�j*8h����='�vr������6�g76d�O}&���y�!������C��y��@�m��S�aeu!w0�m�������9��m=���<9�Y�����=u?sW�7�,E�z��O�*n��Zd��OO�w�����/��s%j:��vs���j,�=���l������m�
U_>��C�}��~r��Y�`��{��~����>t��S���	���N=O/���0�';p�~t\�x�8��~���L��Qa�N�3cI�����ANDO}�U��x�}����l��P*|<����h�C�@�L����S/��QqKY�=�(V�b�����"�#pj�Z�����'����`���T�vN��JW����S�����D��6�����P����1��7R���H�����W=e��9Gq���.��n�8����Ia�k�%$�,=�x������'[�vf9�������?��P���3�^6�9����h�����p(�9���C�����yT-`C�����
3�u
���\7�vq��r
G�~�r
���Wyg����~'|B��� ��O[o�����|z�q���09t$N�9�Es����Y	��OUR�n
zN��S�s��QGT�FK�T�A%�G#pj����`(���������6mHY���_��F��O>����;v(�cx�S��6��E#�
]�s��w&�zzIl��m��9��s����L��|������e(������%�g����Ss���W�����h�����F�H>�b��%
�����vV��rzf!c�?�G�U2j�4�d=m����TK������__}���F���������g�B�/�1�	�S����B��&�=���HUgp��}��C�W;
A�SW�.��o��E��������W�U�����3pj������+�7>���y?V��������S{_�WX��x��s��R������8�>Jqz�L�eX��U*Q�C�g���D�r.(��"�#p�]��N�# �#p�m�+���V���7��x����Q��W�)s��x�h0,�xL@$8�MG�:gn������oT�����8�8�#U������R�z���q��LM�BE���U�������@�>0 ��4f�C����&����x�_�x�jC�O}Tl�}�i����W�cv��[UN��7	��_?��
���tF�~l�M\�F�����L3�J�+O��6����
����a�1g���3u?��3�������-���������)�b�#�(�K�e��B�v�����f!�J�zv�e��@Yl����`�MP��10)v�7��
�x�C1d1��-�|�'!�
<��z�:8��_��{��A�4�f��<�9��{���kN����
���I��2�P5�C:1�����|�(������������hC��y)�������������2�H���8+>��6�!����~B���x�I@���D4�C�S��Y��
|0�WF���(���;�c��|��+8��^L��lN�#G���x������'*�sf\�{3p�V�;��3>�<�+��Pm��^����f ��)����*���1��Q��,:K��$P&�/X���4{T�W��`�B3�r�L"�N�hS����� �m!����3g�0��pj���Qd���^K��(h(���x5��(,>L���6.�(:�
��Q�/�5����`�<��Tz�#.oZ�h��|��9L�jgl0p~X=+�r
`�9L���
@^�&g|��x�C[n\��3�)�^���K0��:�n ��3x@�4�����
Z_
*�W���4
q����LRue�@�+�Jf��Am)�U8���)�q�A]WB�G��3�?N3����Q��L�����m�V�g���$�A,������mx�3d.�KJ[L<3lhp��0�k�0E�p��!��<s���)c�0����'���TI�����a���Q%�YW�L�)m�2qE��"��H_vs�l�*V�#�*�YKU8�h�cX=��"��T@�����X�[�L*��T6��7y�?���J���ls ���X�/�?�%h�����Z_�����-<����x~(��u+S@�rTI��2�L���A�66=peSd�-���J�;L����P��,�)�Fb�%��g%�).F��H
��3I��p>���&������jtx���e�����bX=I�� ��zV6\1��RE#������c���9�xt��4�3-e����9�%�����6d�DG5�A�)R}�J$B���x�P�+oh�������U 9_�G��TC�����^�I���^��2RJRu����+!wHR�!��w~�\�T�Iuo(��nW6w�������I���@�3�SK�z\T��cG.`�4��hqt�;���s{���V�S���������bC?�5�	Ja�V7���C��l)�;�Ms���������PZ��"3�����(2���1��\�lOH����H|���r0���gS��A��9L�P���<>��6s��)�f�M������������`�j+�L�U��������+S@�PN���@
�T}����A��zz�9!:>����;Qd���;L��k���RLW�TcPe��Si�^�2�2��N���A�7E\��Y�T�J����?lH����,��l�����p�J�����a���r��!���^�Q���U�D��uT)������iP�	c��o8��W��`3z��9�H�6/_����>A���������-c5g�C}������Jcn�����!��]6V&��?���6cj�2Fo>!�?�������W��K�h������B1�v��zVR��Q���>])�qI�R�b(���^��%��y��a����|2����X~��\�?�����9	�$�����W������x�&��W�g��
q�V�FR�D�8���t��vE
l��x��G�X�����^
Y
Y�%���APY�O�e)�a�uTR/5�\)�a��<eJ�Z)����T�Zk��r����}�)�
?��p[�a	)�q��6LcP%d�V���D�0���P�o�c���f;��P�TTU������V�����$bt���A�+q�~a�����J���Qq �MyNm�A���>���
Y����*&�����CS����f�+�?|8$�>^�6�?+�e���6�8u1�����a�N���f�����]�[�J��a��^��f}aoS��RE.��8��Dy*���G�!�C��&V��a����6���^�I��)�H9L����/G�@��^���rU1��N�[#1��dN��@
��dc��)R}�;Tf�M���.:��0���y
;����~g/���S���I~��C�p�9G��p�8��;td��m]���\�?a���R������O�q�jV��^mhS�DOY��~����9��Z�S{��>�'N��26�w���w�|[P/7%Y������0U�����`�S{��g"C���
��O4�u�kG.���z�-�,H�,�x����fU�Z�S��<���>!q�M�D+
�rV��cp��#rD��S7Hx��O�&F�8u����@���;;I��(�q���%��D�Y�� ��8�i��Dd�J��HU@d�85V�:,-��9�H��8��fGmq����N�GM�F����8�g�e`p>��Q���B��R��<�T���v&����Y/��+�Z	�$U��L9hCs^�n_�!f�[������J���nyS�e����z0/Y�Wdi)(9����� 4��+�p']W�
���X���aC�����lq4�N������qz���������S��S�b�CB��U���<el�W���y�-[�:�6�?t�y[|���X��!|[���S??9�2=+�o�|���c���)����]����gMW������5����/����W8��
O}��+W�������Y(3�_%�?�VkV�X�������T�����/���a�+�a�?����i�	������{&�B�o�F���6��cj���L���O�)<#*�%�zw��T�����S���U��n��:*�X�"K8��� 3���N�3�t~�m���S��S��T�n�v���Y��
�+_eO*cS�x����0�)^��J4���w��rQSH��2;W"�.xf8��_��R��f�p��E'�S�;A@�.3�/�����dUUN�L�a=��]���������������~���K���sR���K;��C9�0UUk~AY9���S�b�LB5'P�w���'P}zV�;54�|�	Tc*�1J����MUF
�����M(+��j�4�$i����3MO!�lS�P����f������5T0���P�����,�Y��m(�-DJ8�����>��CJ����%��i�����!I}�������@��
�e��O��
zq��T?@���<�!����)����?�q�
��3SQ�7�{a�g�
b���2K��k����k�w�Ai]����Q�V�?��)�=�?�<PQ��Sy�vO���d$mE�S���T��J*<_^�T���P�����}�]=�?������l9~�gn�m���C����?��J�C	T�KU
���t@���4%"L��@�>m��*�����3�?N��jx��J�����
F��X3	�s^t
]����sBs�����S��L���EE���>���l�ZB%�^�C���
(�L�^�J�?��+kU�	%+�z�i]�UP
oZ���w#��xgkK��PfC9�o��PG4N���V��g��������<���16�?���KO����_���MR��(z?�'��	I�9�[kv�?Z�b}���Mv���>�����s��<���}nVM�q��ue�8_�I�@�Q�(�P����4/��E�UH����F��D��6��*�����0�{�4�M�	�oi�`���S*�It��P}�hk[n��D��
,L�����
���5Sxr������6�iRa�w�����M�a�L�aB� ��_������m��a��.�&�l��ebe�em-�	����IV�J�s���[U5~>��&�%V����U�I�"��z"���g%����h@c$��w��7�l�k%�g
<.%EP8����!�g?�k��e�������~�?m����gf���z��S��
�z4p��CT�B�$x����'4���C�J5[�),�������;)f0��)�a��8�O�����
~h���S�N�����~U�9#zaO���&E5�3�W)s ��(.����]qB�����:�����Pml\���Iq������ZD�������m+�����[��y`�(��� )�S�
c
����j�i�>!q�����lH���
pJo�2%���D1N
��L��j���
�8!q�UH�d�2���-�����C��V)������~�?�b��9�X���(	a��
-6Yh#���8�HPw�;�2'�t���~'�J
����;u���)�q��1^�1:q���vLE������N������qn�������<������=�����>�L��5p���^�Q_YMnH�����_JF�t���;	����CwLI�;�?��E�������9��+Q�r����|��
V��E����L��J��(���z=��5leR"���Y���k%��������&b���p6Fw�ce�NYc���v�N6�^M��C�S�M9�	"�wNSl�R��8�lvT��N=�j���0�|N�@|��]t�8u�Z/[�8����S�#�F: .@"�^���m�������cXB���,�b�C�SO���b7�������������g6��3_I����[~iD�����]U��l�j���>�1C��|	D��vP �|E[�w���i��@sN���;�?
Y@�q��l�'������8�jW�=2�.�ge6�S*�ge���$T��	������m������z�OxV�t�I�
���B/)��xgNaiX�6)�Y�*,��?4Xi��ZU��ge��Q�����a�R-��YuA���K��m+Vs�������r����'������+�.;-���{XBM��;����R��A����*����	��+���g�*'.��]�/�?N�!&�����3	����j�dW?��y��Q;��2�?�F�����~���l�,��S�P/)�������3���m
�
A��x�,=+/N�&(6����ge��)��0�\��zi�`�Q�<�d9e��%)_o�������Y���|��^`	aCG�VP��e����S��r�g�7��^,m�-��6mH�%	��i��nq;���8���-��.�m�!�|a��C��A���V�o?+s�4+m��k���S?	���3������F�\z��
��r�I�T?�}��a���T]K����Wa��U�R/���v��`u-u��cp.J��h���O�rgQ��
��6	��U��M�����nY��������O�0�L����l��aK����R�H9_UMO[����V3��H[�����F]���d�<u�C�s���C�?0�gi�J����.,
[��/ml�~�@�k=��t�$[����~�:/<��x�����iWT��l<5��&1��l�"�	�z�5�Tf� ������gc��j�9_q�I>����i�)�H�L/p�z�g��ia���~93�ui��!x��
�U?�LS:lE,�����iJ�#>
.bm���H!p����
O���9����T$���u�;\���0�g�~��*��C��� ��lc����>�$�:D��\���HS����k���29�n��{�����m��R���b��4�S��+�
M	���t&d�S������G]iJ���~����f�)���?��~�)
��3�.g��.����E(dXw8��g)[��r�V=6�e?���6b�YM���c2���L��x	QD���X��u�g
4�W����}�"�@��uK��[t���4m��:��������tF�~:~,M[:T�o�C���-r�|�f�hHv���G4^rEmi�-9��Z�#�P������v�S����.��KU9�����9���L|'�}dU^i��?5���dL��v�������-- �����]����!���5%
,]���M
d�Z���*�*�b��0�'�2��u�Vy�gi����4�iE_X�k�{+m�o���3'SV��?����tL�����lS&rc�EtD��S���7wX���.
d�9r�����h��J9>+_IY�����K�����!��
�\_��-]'�\X������*>��j�r?KW���h�1��K{����{���o
[����fn���K����G�����@������������5a���*^8=�����r�K5��������|����geL������w��M����Z9[�D��6{?*�J]z����4%_���������v���������wj�����-M��N��f��y�@�u)m;����
F��Zb�G`[������9�����g.|��f_Xp�H^�����b���^��o�����Ju9
�=K��~��Z�
X�:���k?���p�-iZT������7���R��o0~����:G�BS�`^k��{�N9�Z���f��w;��l��X��w����O�)po?CUJ	��xB
��Ot�S���zVF�������tkjMi;���!^7M�q�}~���5���T�A
^8M�����aL�H����!y��s�vk����V������&��������Vg���F\�h���>K��t������1P�q����������
���1v�I�����Q�$������UU�K�����:B�N��H+�d���t������p�Kc��^����3���U�R�-8qo�QX���0��������C7����l�U����&a���5��'�����>%T�<2������BD8n8���<����������Z0W*���6j��})��,E0-�p�1 ����
�5`������L���g�h���a*��2,F�+�Nvk�`BzaW���	N�@��\Y�4��+M�v-�G�������R�����]xB9oo;����FQM/��?�w.�{u�Hx�f��O�Q�$������9,����";K�����~j��:�������g���>����Sn��*���q��YUa��\��g��K��c��[��z�m�O&���l���xL��=�U9-�����&�]N#P�0���'���_�
�\D�~���]A�����v#�Q:J=��L���5������=Ji�o;�����U�t������K���1
�P���O�#
J�f��7��%����.���g4�	�
!����Q$k����i%S��pH�cGMK6�?����}�-��������h��ds;��dV�����Q��������dknvN<��a����l�3\��^S�d-���J\�:���Y�1�gD��-<:uT��2X���Ny��Q���CG������-�L��F�1H<����P�����te����er!�2�Z����g��0���w|]i��iX����k�Z�6�Q
��
�k�ds��-
5��x<+��i��]hX����V*}<4�i�p�q�R�aO��oX���fX��RK{��1#�gQ�p�9c
���=�����������YUk�f7 �.O:��Y�6��>#^{n-�����?Z������8?����@��`�l�B����������>��Zw;I�D$$�{�\U�-����COv�Nx�T��[���@�pp����"�v����f�v��R�4
��YX�i��#G++Y�r����`�-�����!�]3���5jB%�Xm{��/W=���*Z�c��������W�x�a���D�_������w�q[E)��1T�Y���'�f�VP\�3����Sq{��KO-��}�g�=(��nm({� p���4��d_(
���e+����Q�rH����|}�+�l'VF�a�����������;B9E�v�tY��^j��k�b���_�/{�����#m���\T:���c$�@��;VF��oe���q~�����;�nZ+#�G]�'���P�{U�	��F����D�G�F�j-��������(X�XQmx�N��Fy_I�$����-��{}����$mO����Gn�+�8����������}���!��hI�.��Y�Opn'i�<7��PL��[���v:w#��I�.�/�s����X���x����RX�$m�c�����*rkI�.�1(9�W��	b��y���������&l�A�.�Xr����$e�um����-!�'#�[-x���h	a/�����q�K���<�����v�L\7��_{!�3BZy��;��I�~f�x��/��������'�]�yd(`bwI��LU�����!1��h_��{�����	16	�m�a����R�� h�a�s����B�k�Q���8G�*��.3�o}7[�>- ��� �%PW	��x
��Ul��;wmUY���aUj�w�~��J�0F��"7=-I���o�"vE��%i��+����U��i�z�g�q��K�����B��$m7���5��p�9	BOV�g��m��9�zv��0+c�v_�����+�(����#{���S����%i���@J#(�p��%9� �|K�@���z�7�/&X����+��~rI�������z���]�B�d��
�)c�8U�)�L��s�����A�-)��LNV��+�~�2��Q�\������[�O/���
�2�����o# ����]��o������p^�����*7��2�S�����V��
C�5�r�`*�Z���7aZ�
���J���L����T������]OR�M�zU{�Y�0�.��j�s�����V���~O@�}S'�/'�m��������Y��M!���,��u��Y��hI;�����������T|K���Ah��M-I'�{i���$i��J���z\C�%m;)
mR���&i��R�'J{�4-�J��7��n87���Qe4�����{X�a�c�qG�0w?P�V��Q�U�#,<���a�������E=�7ci^Wu\�]���"	�!��b����
�Ljn�k%k�m���"E�,Y�mqi�\g*"kiV��UUwjn����;�����.������2�4
���(�����������n%���)��t�p�����S:76���]	�J9�>�t+�Cq����&����a�y����#�H�������lX�2��J��k-���[���j����X���0ou��_en���v�^�n%������)�D0����b�/ZL����5��Y8i�a�`�����b���(�f�KV��1]j���6����r�-�o3r$+��I,�o������PF@$��Z�WV�3&S����T�
!��D�����R���K��9>&o����+9���Z��r�]��^��L>�}�����`}��V��usO�\���D�k�G�p��W���sd�P���TN��N����Q��K�|?<��T��xL��.�J�>��������yh�
'��3dP#�����&�m�9R	�h6���^��3�D�C<���4�o;*Z�����y�Ut"��?5
9�B��'�;��^�4���/�(CiT�����P:�,<�j���m���+L����kWr���U���?�J*F�V�p:��Y���B�xE��!=����~6�z�"�=���)V��5��/����U��lT�K3�N��i*�����S�g�����~����w���&�=��=i�:�NW���w����B��{�?���@�����ghV�zU+�;:��L�]M�{�����8�����b��ih�By3")q���R���P:��+�n�r��O)%%x�U)��
�b�����
�+
��LF!������|?�]����jq~����Z�[_����|5oB���=���$i:F�e�����������'�]��k����K3��{
4�H�+R�R��*N�.��Rn�K��X�����s���7�S���K���[@����{�ci��i������+Y<��76�<��W�~����
�qO���*��0
��WR6�D;\�u2��'�4��R�^g����F���g�^^O��kC�W�a@�n%������
KS8y��U�
������g��*��p�Xj�Uy���X���`3�^�\O��qzS�$�`��,}U��j-PC�a�B��(Q#�a�B��f�F/~DS@�~d�Rg�v�H��|��zc�#y�������3_���m=�o?mH��`*�Zg������R}�	|���h�Z��5c�HC����M�~+C�W��
#�
s�����w�j����)�����C��?
K������o���1���&�0%��|?	HU�K����|_��5%`��4�4S��xc�.)�`x�����
t?N����
��K��+#x�V��������d�����5��]���A�����H������i��W27 s��{�O����y�y��^��1g)�C;�l]�cq����EVn��z�)�����i�(����*���v	y����B-��Z��8���%
����R�O
 U����4^���L�A0�!���<�V�szL�������
��m%��W2;-&�T����+���^7�����O&�	;�?�����C����e)��pV��L�gti�����5dg�\��]��5�!��ew���P��T+m`��DP
=��`"^��C
��;
�
�����~0���X�O�PE����]Q�����	8@���Q���s;,k���5!o�uA�������3QY?V��a����q[U@D��_��=�����X��x��d�G+ux��E�U����6��C�>y���1e	 B�����i��iWb"_)q����F�����h����I�������L��]�c`t1F��|�J8���wj���]�5��C8�>}�v%�7�-N�����]�*4�q�E�b���k�ni���f��8F��JQ&�P��<���(�X�2�&��;��02<���j���R�J�*y������M�8���^g���C������^�S��w��^�m ��������F$��.K��Q��bepq���^��
���+1���%e��2b�4�4p�nUz���&�F_%��k y���u�aWrv�e@�����]�Dc}���_Y��t��)�`���������|�JH8^4������=;���sl6�14	b.Xis�L>�?��p�7\3bhQ��
��te���C��"5Zoy��O>�3�I��������p:�������D��wb7Q���d]������=��w�Lu���5��������&����g?�\4���_#!�J�}�")PY�\���d��Z�|(�J*�U����N�v%��w��YM��PY�
sE5<�JN�$���X��]����A���k����[��-�����v%�n��pXe�C�_Nd50��@��+��`���m��mk`O��G:�g9�\SE��[��&��tt3PN��]�����GO��A���i�v�=�J�u��	+-�����f�3�J��������n�v%��+a��p���2�N(��5���1tHH�5q��!�:��u����~��k5�e��%e�O���aT���u-g����`Q:�'g���y0������n%�����d���}�JZ�4�S��u+i{t-U��_�����R=����4�.r4���
�`����k_�_����es8����W�l�
����)�b�V��S�kf�F=��3�����f%���3S�5+E�	���8g���	��kVr�5}3���W��o�g�����z���iC ����w������~~���ZT�{�J��N�'5�����	�gi�����:��~Sz��N%�07�_�K����
E����H�o(-dtN�0E���*����x�J�\Ky����n�s#}�����������f%������?�f%d@�m�x�J�4O�C4�w�
�����t��=�����v��I�.���,Ui���g�x����^�l�����.#a���q/F3ZK�0�����6�����#`�5��Z�!d�UI���m���<��*9���u����$ ��V�|HI���s�P^���d4NY�v|��#�>�(�Q��8)���8d�i���2��]�3y��x���d����9P�W}�J-Y?�B*���49��(;�U3����5�
-	����
�Xp���,�>��I�{�� l:�8�}�T_���)�����J
B4���4*y0\`/����\�]�l
��K�v�K��p/|F��1%��br""q����{���!���2����k���;�M��#���\$q*��SIY����^MRlK!�?������������
L����&��
���~�J:�O���Tg�N%p�97�"W���:�a�p8(������j�#��e����$�F_���� ���
��T2;��nD�0 ��^�B����}Mo:R��}h�F%�M�`}"
_���3��OI���5*9U��B?V,M��k�u���kT�/_
�� �x�J����4-SS��|�g�x9��B�5����=�=��/_�P�Q����"p���?C���`w����F�����#�@qd,A�?���ms���5 ^�v�q�x�j]?�K�z��`\55Ig&�
�/S}5F%l��F%��QIq`�oEq�|m+�O�}/n�?��R,�Jr3��Ip6IZ�Cc�TH:6"0a$a�
�jo���/������0�q��pw�.���g�F%��K�V��F%�A�tV�!����O�/�w��w;������kTr	��@e���H��]�������#y�vxc $������vRb�l=kH�^Z(4:�6�����
#!�<r���GS+��-�.����}NJ<��u_z�J.W������Q����o�
������VT�'��QIa,9cI��Q	���G������u����Z����d/�4��U����"��`"�H��_�V`������g�P�TL��cp��k�����N%��F����H}[�Lo��
�!f���LN�7���.M�d��cE2U�5?�JP^7�����QOMzm�������q������gD�>��U06��}wx�7��U�������$ j�7K6Q�+m�
�WM�@����alf��cP����1��f}�8��`�K��cJ�iA��q���������@O������w
CZ2+m�^%�N��K�B���*i��U�F ���W�s�x�cre���*i�����
���������$[*��.�N`={�������:��b�3����XJO�@��c�%���S����E�L�����62S��3=�.5�`����L��cz�Y�T?u�W���`��W�q_��4N��<�V�qShU���rXA_����	}�Xy).@�����}��h���l��p���)��F��&39�������w�G��J����DX���u���	������U�U*�7��O�Nh��������\^0�����dr�=}o���gB�}]��	KD�^%��
����������T�3�J�x���DPL�!3��v�l��0����Iv�Ile�_z(�q����$:y��Z,��g#r������T!v�����]RM�&��%���\���������*9����E�siJp+�����v���W	a��"���3�����b`zt
�j�g�V�d�Q��}_��A17�^����
�[���B�����Uu�t*iO��@����L���\@D0�����/��z�\D:����kN��!@��z�Tz����5�o%GG��KX�r�J�j�~&�����f�
�'�}������cS{M����6�L����9�^W����M��fz��)���Y��2r�s���i@�@�}?e/T����w���,Y+�
g��UB��)U\�.q�=����m���+V�W���D����xWWfk�^���J���z�lj9`�
����>L��+�V}�~�\�����*�Hz.t��<����7i����*��[���M��T�;T�	T*@����h�(�&{��U�=C�c�������^���xB��:S��D{�J�p��~.@�gB�O@���&W�XG�,Em�z���T7���l|�J��@jn�#�p�_���"���n@8�cu�p6PY
*�*u�~A���G$�b�jZE���d�5"V����]���8�
�{����.
g+�xV ���OQ9��ce�mC�7��^^+���J*k��*g�����y+UW����Tww���^�����D������J:�pB�P��_�S��7B��~��u*q:x^/ZV���dl����y�z�JT�|�?��T��J�����H��&�t*�d��
�UY�T�������P"�
�{�lz�>����P��>��3S��;��_y�!U�l�
��		J�&�-�6�J*��
pe�\p�SI+@7�g�e�SI�u�>����u*�h�QB�&�j_��u7I���2o>!�����r%�O��V:�t�*����N%g ��OC�N���:�8�7=k���^�Gv�h����J��Zl�+������^�mX1�+h�\	L�h�n%�}�p�>�i��,�J�L=N��0�l�1i��w�w�fZj���QU����m������������)W�<�\c��l������h}h?��������:�l|��m����:�P�<bH��+��}������Eue8T2��u~be{�[��\���t*�0Y��a.�z�J�E<@��V:�\��d��3�J���U��M�����B��t*�I@�.W:�t�AN�hX���#3H���2�������/4DB`�sR6LW<(hV@��r��q+�b��_;���pGY�`�����
vdo�`�n���S����<��.+��vB�����DtV�s|+_�>��;��
�z�+����
O(rj����Z��2�2��s>�Y#!���U����h�d^�\7����h�>V��W�c��~+g��)�J���FB�B���<L�}�V�n='#� ����>��k������ix+�����+S,���������
�z>��:��yCk���~6}+e��_Y���@�zMd5�p�����J�>�;���t���+��Hc[�[{�)�/f��n�H��~ELA���\��JG���zk{�m������$VO���aeq:��R��������j|�2�L��6���'�H�y���G����z��=���xB�W?O�;5p�����3��$�\+�'Z=��%������z��N��C�bw\��|�3UK��&	kg��]�>pM��<�4���N��B�P���	���S�2������/i���� f,�����uj^���e;q���e�l�f�e��xc_��c���
�'6����������������S�J*��
(qrN��E��I���M|���W'�w���48@~���.�1v:07����|��maO���iER����������G] ��..R���_��z���-�T�Z+��#���!n�u�����Vt���3kEW��P;<�����co'�hN��:�<h�5���51�JL
h�9�6b�:�^<���S���D�qM�M^���v�;�V�J��g
��b�&}�S' m����L�z������2gw�����������.8M#��l�854��]����S�nN/n��;p�'7m@��N�����J���d�������������g
�L�?�mV����/N] �X��)��:����|��e�85G���kj%N}�T���HHG�A���Mp��]{��������������P���t��s�� "!�'��/�	�=��8��*�H@N�{�\gg���vu�m��g�l\��n�S0p�i:k�x������zVX�P�6�wr�OGIY{[�������n;���C��8u��g@����GVz�Cs�����3�&���Z��S��+�r����QXh��[\3���_�?M{�;p�]��39X��	�>����k"3��j{�C��-��������kn>!�V�^��j�G2�	�������)u��r$�;8V�����Fl�^X�����'L�TQE&N}\Bt��)���s��zYg6^�C�S��OF�.I�����Wgm/���C;p�aOH���l��;#�g;��iJ����>H���`��
�vZ����l_q�8�>SP��������~�_�}:����f�\
�6�n�.{f��9���"���G
��'V��n���^S0�a(���>��Z�o5|i���~*W�e�R���v�����X�7hE��Wr?��qeS����
��&�O K�S���C�b.�m���]Y]Pi����.��y��&�Sw�j��I���l�8�.���`�2O�zZ�^�
����7�v�BU$v�����
�J�}B�S�K�:��n����]x�3�������x�$�V�5v����	���\���>����
=p��W�H>8;p��*'�+�,*p�q=��������`������~v0*.6�o,��^Gi�[�b"��Y�C��yB�<��|V��Ps�5�Cd�ge�������6��2b�iwB
�����q�$BW.~�SW��1�1�r�5cr�5�����8�v�-�V��x���w����E}��S�3�����5}��S�3m��5��B$$Nm�����im��L|�.|����8��q~�H���������	~�d���o%qj�8�-���ge�������6�6b�b8���K{c���\�C�S�A������N�'�7^�s��~!��u$nyf�n�,�d�sM�'�6��A�������z�N'���q���&m*��W|�Y9����	�`���q8�<�T:53�[�!K]G�d�������/�F�{�]i$N=�������;������f8�q:Nm�S)���h���S���=^�	^,��+{���i�`�S/�\'���3��}p�N���g���>9Y��n�>�ME��I���Y�����>���w8�>�P�o��i+c:"V�>�9U��geN\����'~gr���,�����^z���d�@��Z�y	���+��������e85��z~>�g��~P��'n+�2����B�P�)n�������SS�����2��59!Z��>+3��o���H����E��A�
|�i
��H��(k��)����hO7'���Z
���s	�zo�z�vL2�~.z
��P��-/V��	��o��X���Wgwnl�P�����(V�p��O%�?4�?��>'�*��Kb��Y�g������F5�D�bb�o]S���9�qsd��u.<�wf�������0:��l*����
�z<��n*^��]?�������Lm/iR����^s
������=�#lfbe{��kQj!����h�mP�7l�W?9�F���n����������O�~�����K� 
��mO��Ug��'!U#m���p���s%�\���p��p"���=w�0|�>7u�s�hE?�w���9���������0���T�L��G��9����> ���'����	��:[���V�M�geP�|�{����c|!k���K��ug�8 ChH���	��I'G=	Y��*Hj�0�zg>�	oOG�/���xf��NM���c�#5���@��S�xBZ����3�h���X��`��G� K1�Z���E%�^S�=+���
��0�o�k�5�����3�'f�?+���|�7��T���5p�Q������=+����������728&R�9�n�\��n������r}����_�'z���@C7��
�kZ^7���p�p���[���1�>+[^H?35�}V�,���,�wF�����������C�%�>>���> ���,��O���Zk���1�T>+S����q&����o%������5u��)�����A��3�'VF]��N1�A�ue�P���:��+#�z����D��]���@���zV���w:�O�iX�C~t8���Pt������,�W9�����+k�a��O:�+w>!AC&�`��2p�qF��4��������6P�u��y��[���a}_�1�����W�����	%�w���b:�;��jO�O(�����>��[����������~e�[�I �_v��cN���MO����q��k+�Z�����n���=a�D
�zl}���'2te��>�%���F����{}-'��n���C#a�K9�:�=ch7�>�#�<��5�lvt:����H�����$���}V�Y�7��R�K9p:n=���x����C��jFE�k��c5�&����
�z���>�����P��c3�3��������T1`���!vM��1��@��`�w��r<�����1����V�~
�����WP��f=���s�Mg���2VUW�i$$�J��u��� \M~ucN��8��$1���G��'/�P��t2������y0{�~�B�5 �	h�]#���Y�����N�ex���o��7���2}@ZG���3���> W]V��v���b�yBG-�> �m����r���}*���k�W�>*�*�g�
5����6����l��Q�(n��3��6+'��I����!c�X����$�i��*<s���]Cu@eE�:=+����`�@�w�^2���B���1}@l�U�q��,}@�n�����> ���W=9���������~�DJ�����5/�gzZ��(
���H�M.7,e���������C�`�0�e
����L>O�[����n�=���[��2j�g;�|B��g7A����3N�D��O(b��������5����\�+7��3p�yU��*]��Z���w[�AU4�%�����\��2���=���CO��~�N��Z�Im]��h��-P������z^���ow����U�0�����Z���n�6=�n����+�������%nUW&N}�4�nR+5�����Xul��������@a�����7��<E�i�1�o�S��V�\Sv�~�uez���9�@��8���W���_Y��s���^l8�Z�������-V��U��M������_��?\]9��%7�����e���I[di�H�tu���J��sW2y���m����.�w�o%}@��d��w�[����1���z�ge`���J���w�>t�,:�du���> A��"��a���2z�Mch^�����8u;.�Ru4�@l3�P�cW�tp���p����T,D�O�Z����w��F�F ���}
�n��!?��vlDi�8<�����4��A�2��Z��*l�~gO��W��ry���H��	I��4�n��m�p�CF9W���(��ld���7����'���8�'�p�������6�dw��J�C��-��g�9�.�x�$�f��9Lc���=�`�..@h�4�d�.4z{���!J�7��
[���I�(Q���F N����[�����;q���V/p[�eQ���1�RhV������	Q1����9H����s��^�2hd�D��@�I6�����?��A+�����`���xP����j��I7PFPm���>��
f��f4;���'��C���Ce��4u �J#�Y`w��h����\o�}�g@��!3W�\	�z�bP�IC���iH4!�]H�
ed���W��'4��bZ���������
w�22-N�C��h@��5�&�D{��")��{�@���,��18��7�>����`SOhc��zT�W�"�-�l���>[�h����
����Wr��f�F����Q�`���+y�k-Y��	4�B�@�Z:~�PZ�k��$���_+5��^��N%���,����*�\07@~@��4Aw2�`T�ca�cUh��-�@�:������I#��w�,��r�����ly���hl$�,��k�@��}P�l�[��~�����jj�V�@4��2�Q�V�@���b��xkfT�n�Ki��+dj�3�����������#���a@��|	]�yp�)Pmir��F���}�&e4�
�bcVs+-�;���V�]v;z�49
�[�H��Z�Z�N����-�@�xP�@h��H�zD�>�;�L���!;�8��f�C�`�N��!,��w	����r����J�T�i$$P���w��+#������CwM{	�-b5p8V��y������lT��zW��������-����c`e�Z]rYZ����N��h���+�F ���tZ�bemT:�����Y�������W@�sMX1Lj�4����u�^�X����j���a����&���>W�{�@���_v�ts�=��1]�9�����~�u�2j{����f������Y�x"]v���a�t*:��h�i�S?U���&H*�49�.j@~a5X�V��������H|h����n�S��z���'�F ������Lw���w;�!}�iXm����5k��
|�h����>;�
�-p�YY-��O�YVt���u�x�^#���6��5s���l��tx+.�C�W��c��c�����qe��(�O9����5�.����J����t������~9Z�YDF/][�]���N�.M�j�,@���P���'�j�����<�&�y�����v#Ca�k9���_ �{��t����6 ��Z.8=���g(����X��
�A1��gj�h�
���sq�Lb����O���k`\4��U]f����
�����������s�������Ct5���U�Y�|����[@���E:U���������\�:t�)7��Sm�g��b�w�& K��
v����3��>Tcv���E��	��r��,�����+���V:y�Oh��DW�W�T�~9^�)g�j���G��&'��T�O�E�M���������`�3��Yn�r�BU��<����'���$"zSq�B�����~�e_��KR�$���Kb
�U��70���/���^�Y?`H7����B�����z�������2
&�g:��VY�!�j�fQ�L���j?4@�	��<D?����M���R]�}8�p�m�-� �F)>h�����V����m���b��"�L��V�=-@ ���pZ"�*���u����������o�d��c�k�#m��R�^�~���=-@~����>���p�H�Nr�t%d�+��AH����kRh���v�� ����z��VW
W��c��i�]�S����idd\�,�M���xn��UWF]�|��}���i4�p���nE=k'��S�S�VRj��lr`��5���a�\Q�������a8���\����������X�S���8�P �\��4�x��X����f�������D&�Rm�����q��BBG�`I�>���-�sO��)�>;cC���V;,:-z���i��/����F�k�}�;�>}�i�4k�05���X�
����z���{��g�
��w�����M4�^�g�2!��������l+��� ��^����Yqk�U��_����&�X�2!��Fs-�{ �?=������=-@������'VF�����!��<���b$h��a��2�����h��� ]����}��� 0�_�\�r{ZW�Om`M�{Z�\��`=����-��h&zZ�����'��kR78+����~-@�o�i������B��B��=-@�w��P �������V�L���Y�~�y2�("������lvV�<6d�/�z�	5T�����d4.�H
 y��������EN=V�����J�(���Y�����]S{�=���1=�1jg/�m�e&������|��f�JC�z���������pu[�7����3k{���cn(�fEf����|I�5��!==-@N�[U_�+"�
�zV�Z)p�4g*�
y��;�mt_;�O�0���M����[|�����h��J
����}+�J������H�@ehC��ir*��O��t}+��E2�*���}�� �������!-@N�Xg��8��V3����-,��=��q�`TWZ���+s�=aY�
�	z����?4]����@�[m��t�s>Q��,��
��B+`]K��O�?�W���gkD�������W�"
�������h���wFQ��R�.����W���Q3�.��!���k�/�Nzr�'��B�5v��[]yM�2aP��d����jGwM@Wb8~�'����������T�]�����<M@���i����z"�gWP������DH��
��8������(�-0M�z�@�{W��`I:���Hu[n�R
�+�6����o�<?J1Q�����W�����4p���M���y �s-v�'h��k�by�D��@��l���.��Pr�O�O�`�����>t���F�E��5!"_�<���6��~�zr���}�f
�& ���7�t�@�W�vpsB$�Y�D�H�����[�_jvt��'R�TS��'�|X�{�����_��6�fKU��M���R�0q����J
��<��'�t�7��>p�����p���Ky ��������FW
Z�i�R�Z����~*m�z��r�-�C���������gy�5�:c���R����(���r5\3��B�����<��4U����������(:��P#��'��b��B��~���n��;�Np�4#��H`e�k��`�"6v���,=@
U�dG$RmK�VF���o%��^�d�f��O�M���3B{�H���:?~���c(��u���P��=��M������^�le2�M�O�8Z���V������kv�L�+n�pQ�2���wK
����/�-��Qu����w}+������Ays�<�}R��DN�#�������TLb�m{)��Q�H��K�
jYO��������R�N�M�%AM�R�Gf��\S�g �?n0p��5��R���tX<�y�C��a�rE~H����"��w��� �s��i���dPY~=[�<xz����h{���o�5����1�U��H��+7��5�@�����c�"�r{ �?�sz�����:��������N��R�:���C[�j��5E�&l��D�q���
}�v�\zM�vx�'uB����o%��5&�U8��T ��	���B
s����M��P����:��gw��e����>2��w����-y��AF��
�kz�lfo�7����T����-b��'Rm���8(
@����<���;�)��r|
��i�0tJ�N�>O$��L�x�A;�p����=aQ��J'p�qlT���8V�f�����Q�N���o�+?��S������5��C}D�}��R�vN�zu��`�+s�<��I���S��������S��C�5�;Civ�iuO_��� ��qP�xB�����N�
Q#���w��nR��q��u�B�~t��m�?
Qaf4��#q�s���
zWU'#qj�w
�IP����+�
3����#pj������,����S7���1}^{�5=�O8�m��|+�S��v�2*z��puN�}a>W��D6��f�>(�`e�74�JL�U���zW�`�T�j�cN�3�We�@��r�F���8V��OH�/p�mt��8�������5���g�f���)���f���U�j��*�m��gJ���]���Y]��kH���f-�+�rT��;B��b�#=@�`�C��UZ9^���y�V�#=@��&.��1�����	g������t���
`N���z�<M���}�_�������5"},~+���S_�C0����a(?�nj�S���^dO������B&#L�q~�<����j����S7vI�;�/p�^
�����`�8u[��c�T���H��(+�w2�F��l�+?�gY��s�m�t�L�z���Z�m��������<^�6=@�������ae�C���3�k��Y
'�@-�)�������������`�����8������!�#pj?C����5��
����1���R����7qj����Y������0�[ �S����2�����t9Z	]��SDt������d&�O(@��-H��)�#@��S��B�b�l�S���~8���So�9~9<?�������cq��[�M3"��42U�xP�}����ScPm0g|rV���1F�m6���M%�L��~|fM@Fc�jP��N�>�vj��m���~M=�M��?"�T��Fz���K�1Hw0SG:�L-]��	�����ui�����bP�3�P�� �4�T�c�#O�����@���x5��� �g��\��|�6�MG�Zy�i�p��t)
I�
0�������
4"���y�QY?`���@;f�	7�49�/Z����vs�����pG`
P�N2#g7Cb��(��yF"hJ^HN��(�j�*Gw(�UXm���k7^K���%���n��`��!.L>�;�g��D`����3A��&�zpV�N:�`A%�3���k�T��Z�.
���-��&T�K�I3��Q�X�����@�GP[�Q�'�6�?�h���V������I_��������vA��_\3ld�j-�A���WT~�<��L@���lSgF
=�`�6_���N�8\7���tFT7S����L��Nw��k�����J��h3�?.�����+#�
U�Oh*�;��S|��p�jg}cN&_� c��GHmJG��	��z4:s_����|�9�q��[eZ��Q�kxT(�n&��8���yp���D:�@�}�n#�:����}��v��v����zW4�5��xB�2�3�pX�z���A%^���Z���c���kT�>�������|s(si�����&_Z����8H��u���V����}����f�I�+�h���3X����z.N�L���%��>��k���O�T�zZ�����(�f��O�+�����A�oZ]������v�@��V$����Y�g������G����Zg��cp ������D�\v��
ElfNV\c���Le.���8.�fw���nb+;��R�����5�b��!����Zw�����xM�������qzv����������zB��aez7h��*8~g��������L���LLxm��D�����FT��+��_���!�����*�L�)���~v0:�!w|e�V����
�����a�fV30��������c6��t���,�Py��`��@�[a����?f���.����i3\3��g�a�����)�?0��A[t�2�L�U9A�U�v��bkN5�FB�U�/��c(��y�:e��<*�:�U]����������c���Td5��q�=�}��5|���Q�R�6g�Ay:���v�@g�Y���h����	����Fo2��0m���LrM�����	V/�
g5z��������u���lxp���oMUf�������3�c�<��teZ�s�V/�'�S��z^��us7���sCE]V��H�����a�q���t�vG�Y/�J2@������,��@����v
��3��FG����6�����Re�,�?F��,��}(�?:5�����������L�t���M��+7��Y��	T�3�QQLQk�����~��*���'�������
���Xxs����2��NV&�oi`���+���4��U3�g�t�8��/����oN=��KO�������!'���}�w�>��e�A�JgN=[F���+c:�D�;1��
+�Afbeq���2Y��u���0T/��U}�lr�
V�����`�+|SW�����*me��t��%|�o���>|��Nb�y�W�>�������.X����H��6����8�MV
S������������P��
����5���)����dv���
�zU]6���r������l��ZgN=/��+�U���U=��[@����2��]������f,�O�
xem������:v4�1�p���������,	 :���ZU�Eu�+p�������,��Z�S�����m�����G����M��Y%�rOf����J�J�}Z-/K6���jA�x�C��}�-��kf
�>�Zk����aP�o`�pZ��QIa�!�>���J��	>j+�?���@}7�1T#��>��$��
�`�����f���r��?�fo��uN��������U}fTk$t�������2A ��:w��dU�����A��J����������n���]{�T����q���	���s%�?��W	1)%��������qc�`���r�!+�o�6��ve�w�B���wN}^�}��T
s�W��s������JV����P�m^�S�/��J�
}��S�'��������2�|�������J����Vor��l���P/U��<��	��G����1��J���(P_���rM��3I{y�ae�C�_fKtE�V��N��n�����T=A
�P�n��nKl���K]����Oh�W��N������-��V���?����V���*g���3p�u|C�0��*����\1�����M�����J��E�E�)��t�8]/�[��K���8������I�8�����R.��G���V��.�FQX������P�HL�5#�VG�w�_Xd�r)��bi8�^���N�{|����E��v���b>�
�����w���ta<�t�8����}jGq�T��TW�{��v��������;�G��8u��f���g���|H���Pi�t�~�r zXX�~�D������+#:�%�j�C��`'N������l�*b��S[����i9�2��b��
~)f���S�����<aN�0��r�����~+8w��_q0�w�4�*'j$��w����D�=�go;p��;�������;p�]��M*Xg+��������U[�f��";�����f�Ps�
���[+�_��NV��r�=a�|j���A�)��S�+�6���p��R��R���+��\y��i���G���6t��A;dOp�m�w�"���9h��VW�+c�Z�D�u�N�zN��+��n����a5K��Ru{�?.�L�NVw��������C�So��U����D�-��V�U�I���N�1��+
�3�?��Md�,p��W�17�
�X��n���U����
����-}���yN����#���l+�z'�����2�[i����E��~�;q��S�����Zg���&�#�D|����]F4+��A�:����6p�}4q�f+?aN���������,p�}Q=�F����H�=��q�t�)��(���3�V�
���u:N=�
�!��_���&��^�����~���U���9q��(v���'��sU}��vr?�w��S�&Mx�v��^4�;�r���[���q�l�8aK�<!�?�r*�o5���cRi=�c���B����Q��;p�}���}��o����x|�+����&�S��o��V6�]�����j����T���KU��T
��y���V{B��}S�Y{{�V��N>5j{��E������S���a�+m�'��
;q�3OQ;3���A���q1���J+����2���}X1T��(�j�v��e��<0�����+���t�n�����������w��}v����[�����Tp*&P5uo��S�Y����w?��������[q���S��'d�������w��P�s�K��(I���M��8u?�cL�'zM�6;�)���T�U�@�8u-G�"��C�S�(���R���8s|fX������n�*ww�<�&��a�6u?I�����;�F���j�f:1�U�����X
��zl��k����Z��T?����Ya&��6�G�������+�2�U��'j�3'�6�A^��9��:��������@����T���/#�0 ��T�r����!N�8����8*Z��/��>nV���l�V�~i�����W�(�U}���|	N��o�@�P�Y��K~��h:u��|VF�w<Wo@��������d��g�����t�������S��nSu�m�#���G	�z�O���g�(�V?W 3�I��zV�^T/����Q���R
�5n���7�������T/��C����U����1�)z9T}���a�*0�G�J��50b���q��j#�Q����9!��-���nS�J��7:������k����l���=Y�����
������:������uu��^w�5#�:d8?^�8�������}r���9e�*�v�j����|e��s�C�����C����:��f��]_v��Y������Y�+�AfWx�PM�k�>tXa���da�����G�:�e�@���\N�op���z�N��\	v�(��Tw8���'���r����m��h$Z=����${|��h�s~�w8������g��z�C��~7g���rZ
3��O(��'��X�,��4���(���0�����O8���6�j�.Z����n]�����IaM��@�zB^?�`]�2�������s���	�p}�����{�/c(�?�\:=x��SO�t�X� ��z8�8:����o%�?N��d"�K��@���)7��K��@������C/��kf}?��k�&i
����G����Q��g��
t���4te��a.{V�n57IVu�\y��a����'D-
*�dUWN�nt�*��C��,s �y|��>=���B��6���J�1i�����f�K=E��@�fvI��6�k���R^|z���ez�@�[��o^���	�V�B����6�o�u��j��Xw�t�htV�T�!������=3����y�)K����-�?
+���t�Vb��,��+���d3���m���z^�C����5��u�������D���B�B5:��@����~8
����g������Y����|j��,��TL�	u����lG~e�>���������1�z"���h�B0��U K9Sq*��h���&�]3�?�}�j��n�V��d�]~��^��H�����k�J�2"!�?�(O�b�P��{������t�h�?�������������9u���p���6dX�q����Z2��1�C���N=�t0�j��
��Y9��R���)����L�����.T���9v�N���kF��WI����W�+�.s�B
dog~�������30����F}q���s4t�6�6b�LW��e�Mw���T���
���FB�\����3�?��9F��"�ue�C�SN�Y�q�����q0�V�����Mt]#!p���T��w>�����
Sy�3������Y���2c�;����{_mY�C��N���>��zp�^���g�.������i`e�CPw4t#������m�_8�6e����2g�i��<�`���3�8�����-��8���8��t��2b���}��
�L��q���xN{�����t�����O8uMV�����c
�zNc��U������^B�5���
�B��p��m��h�����;��]43o%XD��M�uL��+�k�r9����b�rQ OY����e��������C��;_z��S������t�Q��N=oe~�����?�V�Ed�n��_�.����b�5p�u4�p���f~�`m���q����*���	����o:<��|�?�@���(������G��|����cu�Ok�fN\x���������ng~��B����Z?+_F,�eu�����9<^����w���z��B���>3��CD?p��q�4d�����W/��.'����W���]��8�����9��8��F�F-���������[�	�f�Cg:��O2����
��a~��^�jc�����I
����MQ;�:">]��Y����3{�#�48��T���z����n����o��V�_?Q-'�8�����92�t�p�;��y�'����NG����cr%���������=:C���n�
g�#����x�@�FM��&�Ku�'���?�������KU
r?J�2B����'�D�N���E���;�AfaV���55����PSg��m}���>c6ue'~������NE�����1���T�{���#��C��v�
1���S����C�����$p������
y�T�����+������&YWZ~+P��*P��t�8>��l9�_��j���`���h44��ln8���Bh�w�nS!T���+R�����&�����|�M��k����-p��9t��3�����^�w[te�����`[c�Nmg�������:�N�|C@yvWF�T$�N��&j����57i%s��=asj�v�[�~�}=?�;#���YM���Y3��W�I���+��a;��}�o[��lC�@w;u�zV�g��5�	�o%p�u���'�	)��N�+��I8�.8��\��;3�.��N�k�4�IF���v&������=^������L��[E>�{ber��o
hs������eWW�����6[�-�
}��:Z���p���"��j�Gko����-�Y-p��Z&���q�.��x��Y����K�_�TL#������>t����8���u\���x��S�3�O������M��x��7�y�>+c������-p�]:��I418�n<&�����|��������[��/��1:���;�Y����8wHyo��<Z���������[�h���
$bm�/*\95��Gm�"E?Z���6����4���t�(s10�}be:74x!L K�5n�����q{;�����py+�qD��8��'�����5�VR��qff�w�#�W�J�� �
�z_����b��z��Z�Z�5c2�i����[I���L�\�w��[��OE�S���o�%N��JS������t�	������:	�6������8��^���y�>�t�8<6�>�9U_���>�%z�p���k�����#p0C&8����D
��8����o�t�����_���SW:����3�Q�y<��[NSt�o;��6p���m����".-p�~\�?��aj�aV��2�?4n���%N�nmG�z�x�)8u[�%����@���F.���uS.O��0s��`��F���G����M�&��1��o�o��Y����<���r�S�u�p\��nm�5�6VWlau�xV��h�a������.�����i�e-�?����uf���z/:�^�![����k9��N���7��}l���4 C��@����fMA���W�S�k�o�B;�B�~Hgrh����6��N�H�����dg�������G�z;�A#o�����x��lb#
�z��|{+J�o;��r�r��i����v��(�>�?8��6;�T_��Tp�
���qp1�`hW�K-��
%��'.Ztet^���i���Z�Z������<���VL���3�&����*��d3.��:�|��@���+��<��6�m���]c����$�
��Y�;�U�����]��L�4�E����+���Q���ZcY��vP L��-�?���F���kP�Su�����Gq���2{�?�s�*x���Ag��E}����G����3���/3�0�u������R_u����a�U�����|
G�Z�g/[��P�tj����U����-��Y���6X'�L4�}�V�3�V���Z4��g����mE����@�[Q��D:�dw]���
�)���1�g���j������j)�,X������'��n����8C��h����d�;���������4t���t�0��r�Z'v���m��,(��u�����*Q������;��W����G�U�����f��q&k$���R��h��|`�L��3��1"�.�~���1v0�����h���X7�N�]�u�(p#5���3�?��Z��0����P�oz�����w��f�h�G��/�a�V?�&Wn����~V���	�N�\3*|h�:Y#���nU;�p$Kx��V7�9��''��<�H]14�z�W��O(��������I�������S0�?�B�<��F|��ckUG^����Yh��o(��g�G��\Z2#:�6�?n'2�u�|�?��w��(�`����}V��i������nj�L������*�����c.h��r�L7�6��"s�.�z�L��>Q]%��p�����������3�~��te�Ct��p�X�������oe��T�����a�h��)���X��gz�8�V��a�-���V���^t�S
���5����NK�[�5��h����o �M��:�V�.D��V7Q&Z�h����<M�F����0��F��������V1�<f������!IH5��0���8FZ����2��`i�q�%UP��m3[Y��P��c���2hD��xbq��D�f_�	�gYZU�+�K����'���z7�=t��V����������@��1j��:[����3hu����;c���2vn�T
���`;���p�FR���������w�"�e~+�������J��M����8�=,G8�Y�����=Pb��������"e���-�.�=�k��$�O������5�LZuE�Xh+���=�����'�k��=��y�|}B:�n���� n+j
�|AF'�������}����TJ�x=�����b��Mwm1���T�1p�����L�a���G�+��k�����X^��M���T���TC�����4�aj��0��P��}�
jSG��@u���B�k��$��b������y
�P�6m�h�Say9� ��V������x=��y����.����WV�O����U��>�;#��a���H�=!��u,�U�JS��k�>tAK�]7\3s�M�F�k�q�$c��n�b��e���S�\��������ne�z�����W9z�k�_��K�M�D|���:9�������k>�g����6�j�[A�dox�9���L�"���i!3cQ��6��P�c���K��q���n���}�i�q��:p��Jz��HlU�� �@�:�C����v����cn����fH��"�MF���=�*[)}���B�����{Us��
�L�}�h	C3� �~�^������a�K���]ks������6���f��C0��T;l���3��L�}G���k@4
+U��$�*��T�z�
���3�i���04�^G��q�))�,i��(#��}���'��
��������U�k|��]����F���`_+����vYq���q�����y�Xb�a�A�bR�M��F+�
Lj/�m4���[i�����������45����g
5
�A:�|������K��t�>�.��(P�0z�v��]6h����n���=O�*���P��ujw��!���<t*�e�>���x|Hli������^A�_�  ���cL�U����nc:�A5	����>�b����h���������5���S�=p��j�y�B<��z��?oEW^��z��S?W@N�Pu������m��">�?�#���J�i�C�CD��}�����5�78���w��2m����>���
���
�
G��[	����p��Ji��	�S�e���~Le�����<)�UbMO��kpB��S��s���6)�;Q�z����UJ��J��S7�4����)��;c���M%7�+iS�$��th\Y�>�.[WE�<AM�<�?
��Ce-Xu���&�<]G�x�4���sM����}@~M
W��>�Q�]O��x�����s�c�5�}��5�in��S�#����FZ���$�������s.]g�$��P�s��S�f��*�W�H�Nm���P�5Y��fZYU�T��C��x���p���������F�?������{�;��?����S�Lv�PZY=���e4����^g��JO�I5��T�8y'�l^3p��/KX� ����7aT���g������.M~W����)��f�����
v�.�n�[����n]�dY�5��nMwG8�s�^��/���K��$��W�y��������0-	���7����O>�9��>"6���O�C�S;"��P�/Nv�Pw[������UU�U6�:����r���
��,Y��	�:Fk1���gL��Gx%��x9-x����JK#T��}�E�A���l���+^I�"y��=0���	��p�;bM�!�4�JcWJ���n�����>~��=1����^s��=~'�&��N��>����p���� ��F���k�eG�J���i��C�3����^1����d~H��
�����|��)k�F�\gYw�0P�����?f|#�Ca�`���
u��R�#(�0���L���R"���)��^ot��?&�\e��ED�&�Q<��)��&�1S�T��x��C^8u/�lR����"�9�j<����u���Z���y*���,��'=����>����M���{��-���U1�1��W������4p����)�p�������)�N=",H�hN�0j�����o�5���15�+&�����+�s��?&�I�TH�E�Nf��h;��M`�����SGCA����7a|����Pv���b��S�)_o�`Q��y��7��O����P�>N]x��r�����d�j���D}	w}���PRO���)��-�k$wg6T���S��o3��"G��?�GzcA:yR�l�j��E��M�&�s�-G��vH��TM
������b^8uj�AW�+/�:}��7�����p����S?�u�)��S��:R��>��H;��q�N�{����������1�A��d�~��C�w{�v��]8uj����_������{��`�0�5N����;a��VN�T$:������G���-T���p���A��G�)�y����Mg�*^Nq~n8��&�,�&����\6^�*
���5�t�d��<We�e���G�����
���,5(:Y�O]�z@�2n:e����-L1���qY8�X���S��W���	���vj	��M�C��?f�II�,�.Xfpm�bj��r/���-�IK�!f�D|�%��`V��z���D�6�t�f]���Ry��t`1��YNO��YX�,O�we����Dv15S*����
i�41��+e������[r�2(�5��	��e�>�S���o:����y���|��������*N1q�\���������g	�H�&�+�PL�:��GF�������?9����I+���H
x*5W����������T2�z��un1�R����i���L�J�7��S�R5_�,���)�'%`Z.��? ��>���N�f��Ss,�	\g��h��5�,&��3H@������?K&�w���L�A����ZN�jCi��4����?����p�-}�	��	���z��c��'(R8s���\g�u����+�'�����}����+P�p��\w��kj��q:(&UL�#0� ����L����������?�gd�,a����a&�pT~_Y8u�H���0������Nb{���N1>ug�{�z�]f|�����"��}��n�ocdv��v��e��+o�2��U1��Ny��W/�,�����iD$��b��Rq;�e��
�fd��{���M���������S�����|A���D��@'��S��7��o�e�����QSkJ1���4�����S���,�.��X��S��Z��N��LY8�8�e��8�T�e;�	��x(B�z�4j����v��?��`�k"�������!QL�c2rTP
o��/VNq�g�D9��u���G��<`zr.��N�h6�tP��T�~�;w���.���2s*X�r���QN]G��Bo��GTc���T-�i������&�b]����1�XO��8����_,*�-�S���G�5��(��S�|�:"
����CY8uH
�T���m�2���8>P���O�!f��Z��e1���*�(4^6��]�=����t?*�C�e��Ju������6�d�����z��S�)���Y(��U1�z�"�UZUgv�����kp,|,5��b1>��w�w2�
~�t?"����=Z�f�N`W�s����Q��/P���3�X<X	�])_N���D������p��cd&�,�'�RN����R>e������%K�r���~�~�c6�4��R��A1��-�
���0n=W�����&���j�3SB0)�������|��H��iK��w���7�X�������"����F�.*/D�IUo�e�t`�e�gW��(�q���W��D=w�-JUB��.�:���U�����j8������w����S"��%Izf��S��f��������|���,l�EF ]N�3ey�C��j��E����;�fCT���B��U5�����9r���6���.Q;F����5t���@a�n�;5��?�=Zn����V����p��U�%O�<��0C�,��Xy�	������kM��ew����\#�so	�*��Qw���U�3�p�Z3,����=Fk�
��;f�(G�F�1R(�!� s��o?�U���N$���%��E�V|��Q�Y�'OAeuU+��*r�v
���������D�a5>���l'��T���������)�����)�Z
��$}���N���R��(�VN}T���!2�,�����oWY��j��<���-�
9r~�N:�b�
�����-X�QTa���a�������,�	m
��i��n�p��ER�{��)�]�1�&�!��n��-�-M;��v�����	T���P�}�{��,��?���YM�z�r�&����L�#T�6�����UN{������ivF5>��`SP���_8�R�K
�N��L��k�'cj��H;"+ZU���~��x��P��\N�&��j�}��r���n�J���p�����:��u.��%}��-�����K��L`F����nS�Z�	��qC_8u�;^���i�����I�[8u�yZ��#_��&PM�#$�ut����S���R�����r�����Z���w�M���-�2�	�n�T]�Df=�\649Z.����ZM�c��������[5���C�v��jC'��}rg���t?3�+1F�L�M*ZR�E�n���Q��L}E���~L��:KX�����5=�>������Q�'U���%�S�=�"��b�B��M��}��}�~ ��E�0P��.�z���2rZ��+���S������|}����:�����@�bj��`�X�����
�'6����]K���a���)�e*���U������y0����~����n�j�S�PN���;m�n�����������:���L�]�7�iV�Q{
�IA�����o=�F�k��G��C��4\�%0����P�R\���U[���Wd-���l�����{��jt������"���E����^��#9j#��z��+|������~�B��'/F�r��o����Sh��s+���\���~02��{kh�l��j�a���17�;`����S�Y�V�g�e7�h�B�����fA3m�l\��.nm���:p/S���:������F����|�U��\��,�<��{=��J[8u��kE��g��{�z>���eM���5����V_�����/�.��Lh���=u���U1�m���w^��4�����~�h�ML�c�9W�(�;R3��D=����m��p����F��f|��pk���XM�(�m85�n8��E��5�"^�[dP�g[8u�ZO'��59�����@
�"�X�M��Yj�A��l��e����T���������g��\6#|�N�C���t?"����\}%i�����2*��L�#S?�4����4�����vY�w�f���\�OX�>���)s���2=�����qAl����p�c����B���������^��Z����c$���WKX8uD��m�Y���c�����@Ivk|��p�����m����k`���N=z��������N��.��v����S��B�b����h�}�N��z����1��C��oW��M�c�����b���~�n��Om����`��a�+��j����>���%����N��S���%������4�����GQC�#��S�PSf(����%��G�T��.*;�-�:��t������0NO��I(�4���z��>@u?��L�#jM���@��f�Q�+��J�&����-g�;DR:C���}j��1�K[.�������p��S��[)o
�oW���C�����[8�w;�-#Q�g45"���N�<�L�����X$��	��T����Jk[@u��,����r?iz��8�)��������0<�7�.j�>
�i�l5���#��i>�k����K��k�$ �2�j(��.�!��{�x
/�X���a�D�2���'�L���L���6��T�N{��NUt����$:�O q�U���[m^������x�i��<�lj�+�])
��w
8���T�{����U����t���"�M��g�Y��������e}��OK�w��%@[���;�^)�p��R�Yz/���d�����"��"�iH�M����6 ��i���?���� �F�v,�[P��j�N_Pu�DJ������o��*]���[����R�$����*Q��R��q����cH���D�b�}�l
O
��t/��U�i�';���8S��T=S�Tt�E�5��&�1�2\O��>����f*�H>+��
��;!��^���	�$��8�fC#�P��]f��,3�_��&��B
����N+a��A�i�����G����1��&�1Isj	�~HG�I�$1H+�K]��?�?)�R8�eC�6�������yv(,����b���?���Pb)������������s��s��%G�u��H�p;Z���T}�Q�C<��,}A��Q:�B�������t�y��(��Tf���7K�mSa����1s]�m�
F�������!y��`n�
U��pM��������AV�z�&�Q(��v�Pj�&�Qvr@�����M����2Y��}��`Ud�nDo�����D��6d��`gG��(g�M�c��*qx
���Q����,��tUL�cW�x'�1Z+��!K �����T}���'�3��3:��6����T]=��K^)]�/�:}�����'�h7��IVP�	��p"-���b�xwP�]Pu�\O�:&TiK;�vD(���\��f�������s��eM�/�gj�U�	��l 
f�U��:�����wn��B*�>��N�?($�b�
--��bT���O�}���HSh�	��d1�B����W		A����?<S�t)<�M��E5o�|�����e����T]=S�����|_P�8�8C���ifC|���K���D�/��1�OXP�X-��s�7A����P2v�s����~4��P��P�G3��]A�J��������Q��~�����M�c�[���	������,?�29����?b��2�R���?<I��>�kn��3����k.��
$`������vO������&�
%k�'$��B��$F�P�gj�Z���$��lH���R��A�'���I�z�4e�U�n��e��(��9�	�6Zn����B*N����m�l�U��Y�D[n����U�����������i6�'m�l����������$���P�:����C�&���yx�i�!����'���,-���R.�d���&����8���h��Bs�;��3d��(��mn5Mq�\�P������^Y8u�|=����s�m`�>bM�-�
MZ�*�c0�u/K��h	���D�5M�������=Z����4�-u��g8�����?Z.r
7�P���eC��^�����Q��f=I�dr*Fk��7t�w%Tg�S��!�������DLfr����*}�N�l����|�#~o����P�!5���S7��m���:�M���ODo��Wg��F�o]���i8��>E�`}J5��p�J����:!c��Du�x�JY1�Z�E�C��u<9�����G�
�MNg�>�oS�����P��!'���s�(�^�a���:"[�� _@�c�^����E���-}.J�DTV���I��&��&r:J��m��>+�-���cT�t?C�5Z�4��(<��F�&��j�O�7��r1�aS @�e��\Dl����}f���	 ���s�p�8_�+H,�Ymh��ab���	�3Zv�C�NA��/3iw��������-�������KE�&Sv�ntli�A��	���/�
���!����-��e�m�i�j}�n�t�F��b�(�7����r�a=��8��K���M�cr�4��������z0J�e�e����� �B`	����s�k�D�LFO�,�}&��wLF������GrYvi�Emh���(]��55�x�4����?�#J5�����O����N�&R =����l������Op�#U�C����n&�w��YBK�C����oo��e�g_��-�:�tH��������~�#�)XX&������8����!���+�8�[G�h��������<\S2��YG<�F���L�#F���9]��S�B�����V?d��\�F��cT��5�)F��N�
��5�h��.��]�2�>q��`8�DP�A�'���mV�Q�T����N����B�rCo����O�:���
`�����p�q�-��G��'qj���S�YX�<����+�h
�W,������h�}�-oa�4�(���"���n��C��a����S�Y��>��l���L��ea��Z~h
*kK�t��d�9qkj$n:�p���k�;�
���nT=�-r�u����#��L�� ����}A�.-},��-�
MN���w�:��4+& 7p?TXq�\�A���"��q��}��� ��,`�L�#T=+F������L(���������S���i�_,5BZ}�\6Uj)!��,�7�����h�����?O������+���$
o:C��!�S��k�����;^k�h���7:�p��)<S�,u���{�g�lh���a���$R|����S&GK�6�.cUL>�@��R���i���!�8���$�ce1��z��hcC)b���t�����K_���uUL�zr~1Z�x|������	EF(��/�:$ep����1m���)��E=���Z6�D&v��'�h6Tw�,G��s����������ar���U��3N��.���b���Loz"%�C�������CY�C3���`�V��d��1�c�764#�oDc_��:��RU��5M^���|�@��0S}���"����5X�|���+_:*�7aX�/�
��T�x5��@����T���GK�b�C����RcS�JNfa�#4Ox|� ����V�o-�m2^�����l��<�UX�x7`Kx�����P�]x�p�����0,�z,�L�%C����"E����(��-��9Mx����&V��{��i#{��	S�����D@:���h���)9�o�MW��b�2^�X�����!�\i�hZ��fH��n��z�FK���7���D�l=,���f�P��jy���Ye�%n�
Y�L~���n=��x�k���NU�S��U�hn����}���/o>(��g��7}F�V�����./�{KG�G��p������_>p+����� �[��[�)^�B�Em�c��_=�iQ�Y78����V@���7��Y�X�!	��^���&�7tD����?*��dNh@�dj ����c-
w��hw*F����c]w�(��o�~a�1Q|�@�����V�H���}"t�\q��[�8
	XJ7;�u��.�
����hJX5A*�`�fG�5z
�p��U�H,��A��MN�����e�sm�:`��^1Mv��8M�I�p�t`��\k|�M(e��1��B�'����s���������gOW�{~�������?|p��o�s������/_���������;�����O�=������g/�������/��8<||q��je����W_�8?{4�����������`���j��a������7:y�p|���1�����_t����g�=q����#����n���#N<���1��<�����������|4�����O�o�>��:{�������J�������������qu�d���zD������'��?9��t�sX�1U��W��W��1��=����������������s����������_���������{_�]~���/~�����C{4u7M�������~���k����Wg�������J���~������m}�����~�����_>3��������v�����'�=>{�p���s��������qf>:|8��G�=�(�0~�_~�����o?��/���{�/�
�������#��Fx��^]��+������<>����W?�~��������~�pl���?���O?���/��uq9v������������{��wO�.�������������{��������}x���O?�������_|���?���~����}���~��������|���}������?:�������~y���iO/�������������?���'g�W�/���/^>���/��b���g/�^��h[R:���%�����=�[��|�3ZOS(�EWc��}L���_��<?_����/�:��������������9��3�����<�ss���W�����qvy~����x��C�������r�w�y�����������t%w>�z�|�����g/�w��s��;����/������t������7�;����1����^�?~z��|������}���on���7����������������x,���������8:���~���7��������X��RK���������78���h	��?����vvp8�j��o��pw\%�N�2�i��n����Xm��qo�e���vpc	���b�b,�n���O[�S��n*��~V��#�Jm�����X�A����}����X��6�!�L�a���������Y����k�S,��b����|���%���,�v��tw����"}�|����<��8�5~���?u�~����&������X,�|cI1�OO�bB��S����Zx���n}���"�������K<���������������c�J�-�r����uO�W��{�z�d�j
��N�r�������k�r���?
��"�s�s:����^^���|�������}t�X&�q�0��XZy�X�\�?���^\<��{�/�>:���o,~gM�d^�����Q�V�{h��\!]�g��_���u�x�Xn����X��8�|o��������|����h�g|�����
�[��������}O�e��M�?{�������9�o5��K9�G���'��|q��zX����8/���=y����������l�`�����N����������nO��������^}6�2�;��7<�=�;������oh/�#����j}����eK��^�����Nz��|G{9]���r���=�����t���������9=�������%����}^�����j��'��yy����c	��%�_g��n/���{��|G���l^~�{��|����7���v_7�_;������e��x�F1�>����op,{��Vk�m������>���h;�o9���=����w������������g/_<<�5b��W���h/������WcV^<�>����e,����8������/^>>����X�n{���`��}}�}���op��vc!��-��5���c�y�o:��0�����������,~y}�������|��|O�h=9���Z�����O��i,6/o�l���2������y,g_~�����q
|���S���r;������_g������_������o�7��on,�3�C{����wu�5������?�~.��
o����D�����������_�_]]<���:�����i�l_��W��^��l/�~��W�	\�y�=����w_�m,���}�5��x�<�|~����1��G�����f,_�c�����
�������{�G����|/c�/�����������|����y����g{��w�����������tL��������O_�#�3�k�g�=���G���������������#�������|������v�������]�6��,5�co%�������9�g���X�)r7%2�q���f�\�5�ncy������RN��}�!��)�������A����yyS����@�g�gvi����s������]������E�eoJ��gj�n�s������^��~���9�'����S�}�cy=��Ucq��^byF���������M\�7�������76�LAy��<��o��m���M����s�vqQ���|�X��[��_�Fo��-��Y��+b�o1�����6qQ�&����y�����|�xko�����E���s�����~�vr���p��&N���Cz�xko'�m���M�����6���&.���u�r�&������g��}�8ko��-�C��8��������m���M�������p��F~���������p�������[�[�Y��XR��������y����7�c��9�����7��%�YJ��\������?��<�x������w��pgJ#��wwg�����������������������o/��:�X��p����������I�������~z>��_���o��/�|1���WgWG���������O��8t�p<��.��=?\��?9v9����y��4���g/����^��]>{zw�M�����oO����������$�^����qF��{���~�������j���h��;���?����~:���~���?���;�4�w���������������������g���h~�'?�����|�p�,<vy5������w?�����g?������/>������W��_���>�����?~t�]�m�{������?>i�������sh����O>z��Gsj�o��?���'���������/~0���<�xqx����r~���|gN��$��q~�C:�����;����G�_\^��xrx����<���W7f�?����������|��G���=���{��~r�a�������b|=c�;���;������~��������e���/nf���_|q���zg������+4,����Fo��O?�(�Vw�/�_��f���|qq�_^9��?�P�'g�������e����������<�����/�|�����������=��|v���=�rJ��n�
�n��=��G����a|�����{�h�����_�s����~��~������
������gO�'~;���GG��r���������;7��zr�����	i>8{������{~���K��_5�_����3��������W�G�G�z\��8�yx�������0N>�������c��t,����������{G��x��1�=������f]�����=��f�����Zv�b��O��������x�lg��������8����'��J�������/�\\��'�������G�Z�_��hc/�8�{�������w�������/�x�����N����|���R��5����+/���M���x�b���?3c�����8sb��K?��g�/��V���qgt1�����*?��������-x����yp���|x��;��������ot��57�?��/ggg�v��V��h7�������=������^^�}y~���k����1Kv�������bJ�/����8����"�h8��t��z0��]��G��/���O��w�7�/?�����+���}��O�e���w���������Q����4����C��}�������_�������0��9����j����������~<kS|�?����~��O?���O~�����6<������#���M���7���c���{6+K\�x������+�R�|���W�|q�������������2.F=>���ht�����wc��=����7�=33������N�"gO���u�����h���o@���l������������'O��=9�4}t������?xu?��g73>,��r��9]}]���O������;~���'���������O�����N8|����<}���.��b[��������7�s�����-|�r��
��������s����O}-������������������S�3;j��.���������>9������{��%��I�U��tZX2�+�W��i��U=������f�����g�
	e���`Z��{�f��w�����y5c���
�E�W��m>k�q����Z��J�K-��[��L���*�QK��f��VN���5��O���Y�[�NJ�L����VN+���I�;��a	�G�vZ��/C�����X1��5D���1P����{'�y�o�
�����qJ�_���������b���~K�0��k�t����V
�����_��J�����[������iq�1��}~c/�uY�YZ��S�~����,U�`/�{/��K�N
��{
�����{[��*`v)��oE�������r���
���f1WiWV��y��?�����f�H�ju>��o��,���tR�<,{��Rj�]�z����D)�<�c�u)��T-i;��8�����e/UK��q�O��y��^��8y9����VQ;��q��g\%���+�;]�������N��������s�K�R��7�O
0�{��g��v�����]o���s�N�_r�����^j�/KD���&E�G�Dv:�)��j��K������{����|�*�JJ�v�<�O�����+�����^�o�����m�yc/=7�}GU�:-{����������2z��!z���O�%/{�c�d�w��;�����D���N���]8-r<��y�2�e/#f�}Z������������:�e/�����Y^�R2�q�yw������tB���9�����E��A�Gy���t�]������������_u�]���p%�����O*_�;,["��,���3�q����:������Z�&�6�Z��%I�V����*
}R[+U�p��>�b%7��r�w��7���(�h���+n���.,b���j��#���H��E���t]����qZ�.E����1�������\n���7�q}`3#��
�B��f4������Gk���G`_����f5)�;Dj5�P�j�y:���hf�
W?74��0��I�:��j�,���jX2������n�y�
zDT�������[C�M�[��=f��Shf9��	5�v�cl��������������p\�0��Wz/�l[!p�A�����g��O�mv3�=���-���M���#��B�����)������?��~��e��O��{�����;m7fX�����"��X��%`N_F�����i��=��������`4���5}�����p�[�P�jF$���[�t�r7��E�"�H���:A6�(���{�Y
e�
8���
���e���,�������j�'����A5�����*q_�7INm����#����\d��[��{�l^��G����b�#�,��H�]WW���t�u[��R�0�3�	�e���8k�f>%`7��M)G��f>!�b�+ �
~�TmE�����a�������������3T�|� ���OY
�c�bA��O����2�z.\��
����V.:Z�w�����2K��vpE����=.��r�})��[��1Qf1�`&TJ�g��t9@<�4^��Mg��(k����8�:��eA�)�O�g/��8���v�"����5E�������g\��&f?5p�\]��i>�G��dn��|P��^#���
�Z<��!�,��������AqhN�r��07��Lh���0�����qE���-�8����F�T���YP�0]a���f�,h����"<P�=���1?
���AK��fA#4M�����-������%�S�>k��������,y|g���P~��&��:[�Gw}d��g����+��p�1��=V�m��_H�p*ci���O������PG��>��3dQt�0�C��-s�`xrK0� !��E^���6.�����0+�K��c�Wk�.�<��l���QNB�1�(�[l��y����4�n;Q��O��'��w�G���D���d}�f?e\��W$��<w5�'w�M��l>(����k�g���5����l6Tg(�DQ
�����=;�0��������2 ������� ��9���4�@�*�a�U��Ss/5���dF'V���������F��������"��h�������<�5�
*�mi���������f?�{3���n}�����\xs}*����,e��rK��������*#6m�5i������@�0c��}�[��1��R���J��i^7J�LeA��]��2%�.a@x���������������bSG��
J���
y���GK��'y��=Gx%n��}.�1�K�t��C���l��H����cU�:�Y@gHVe���e�zf��v��C�-;���[���S�k�0�~�0�����\��sk64����!��A�����
���|���P�,��/(:%�����k�ZPtvz�(�b��������Pcf����A������?���@���q�G��b�N���*��<�(����%hK��f��G�C�������p��v��Q��^�V���SmX������'.~��^O@�4g9�@���J�+�����89�A��aMt���R�hY�g��5��v>l�:�bA�6�+y0HzV���>"��q4H:���l�$/�� ��sUP't
��4� �X3���|r�T�H�|g���B��76TZ	�4
K*>�"��O�1�8�"��=,@��V�N���2f�i��-	��nI�Nl�&b1�hK������y*�>bIN\������AI0@:��l�����:�u��������g�L���B����]�`�t����6/�s��]�{oU��7YO�.�q���g�DmaA��B�����L,��$���qB	[l$�)����]�S��l��B����H���4��f���s3 �E�����,D��&,�����]�	��K%]�����v�u����N[.4n=���������V��X�<&�+6:sH���]~���l����qA�U9(d��oopt
7N2��h��S*t@x,
w�
"^qCQ8:,8�����	�r�hN�C���o�����N�!m�a��r��1��j�n-�����=Z:,
G�[s������4���gX6��0��o��$�	U�+�0
�)����-$�����
�E,��
����xW!o��H���Ay� �G��!Vmi�������G��$z����}�u�2y<�a����:Efo����4����HW��I�
{�'b6����i=�e2S����������P������[W�+Ei��l��� {DxS3KJ�lA
�`\!�3@:�\�5�?!��G���nO�����������[���.����6��[����!���)-��� .(������%�=_s���gm�6w��d=0���g4h�p��u0D:��Hp�	Q[n�6u�������0JQ�0��<_��dL+���<;�H�E^}�!�#~l-�6t��'�t"����������$`��w���#��&\��GD"v0��z)��,<zx=y����M:�~d���f�kF��bp���k8Of38�XmI3�T3���$���B08:��k�(HFk��<Iu��7yYM�F��*���ilj�����:!�U��}n����x�M����sRW��h�]X���0`1/�4$Yxt������/nb�.m�� �5y�(���n3EMS`�B���]�KQ��j�CG����BA2��q���td�{�Az9��Ee�!�{�o4��uV=@#����:�tJpx��d���������{X��p�����@�Z���C��w����L"���P�6�Ky��r���q�����:h ]���&�T�8T	a���[d~jC�M(�qA�m":2��F'qS�����0B1�~����x��:������<}��G%dD�YE/��P1�e95c#{��!�a���sO�r�d����>�����^���r9����6�q�P��h������)������dF�I�MN@�h�e\��+X�t�&����[���C5f�S��G��e9��@5O���1W�v�1b��;w7�]rN2���Bv�c��l�.l��P��!���xI��t���IU��vm�@���LR0�Z�����PUE�N���I�����qA��:�>*�����8�!0��&��h��L:Re�[���$4�>����$�u%���WT�!ScB,�D4
''�-YF����)$�2����p�SX4�1�H���P�hLHc�IA�D�UTac>�W (��i���,'%��G��^V��p��A�
 �4�)~$A`GB~�;K\8s�'���tQC)#��u���qE��:1�#w�h���#��3_Y��p�,P=�S�E[o����c8:��G���$j�<50�Fs%A�|UY������v����
gv���s�,%
[
gv��5Us[��7}����^�@s�&�1y���P�))�R��b���u�fC���s���bC%����m�4	5.���&H���TDi�P��zCUw�$�'.�9�V���j�p����2�[�]��u-n:!(�P����P��K����h+�j�gW�M��|��7��B�D U�"�p���B}����B��{G�Hz�W���m�P��(��mr5����(:�-sg�r�ksNf�7z!�t��������=�_SS�&��B�hoK��e�34����k6U\ s&��I�z�f������s~�,5T7��C~��J��7�'�gMz�kB�&���������k�[<�
n6���'Z$�S��,���B�����S��l��8��'J]����g���S�5X��3O�7��V�f�0Mh�X��$�OY�����Bm����0��I�M��Uo�G��m�4�~�JaM,
��'����^+%���6
��a�+fV���`f�S��DD;��]SPe�0��?c=�����;��$����p���&"��'���[���;7�D^sp$HL�G� �V&qa�N2/*8���E�-�s�����(B�L�zvnK��
]#��U<���@U�8�<�K�����,�B�STI�
��NfZ����r��	�j����p������,-��z�F)p�YeI���[�H���������B���%P�����}�p%��Y������}����>��K��z�pj0������!�l����GsA���9�I�4���V��aM6������-�w�L�#xD�;1 �G�	��)O��I����u�-!��W�t���w4*7���Y�|���~[�c�t8f����w�cF{���z�O�K�\>h>��<eP��v�/�G)��#���<�!���P!i��C�f@�#f��g$ygI�z�}dN}�J��d�V5 ��diiTBd�,������p���Dy&�,6���/��QR�������!l!��D�2"�������!������Y��=��]@ts�<!'���)m\��T�J�A�g���PuJz�k������fD���[J��P�w"BIIFz�����$��7��8pi�Y��M�cBe��>u578:�i��>M������c�d�����pL�%�+��[t���zF\��F�~U:Q�h �ej�h\ahtt�R�-I2m*�j5�0�\Zn*� �)�`�d]C���T�	p�JY'�C.	��I
O$C�����Z�J�o:
)���Uy�M�F�)
�*��!e&#>O���P�P�:C5�����"���#z��C��g��M��'\��FE�&�B��������0���p0$��t�x����_GO��d�L�OuC�|��a/F}����|��\��F5�=���%-B��h�!�q�g���,��$���qj�<3�enJ�!>������HGW�F�g����Y$��������?������{N����:C�{����2��#5�Z�t	�?��'�J�{�of{ �3�X@��5�:�����93�y��%��n3 ���D2����X�jmz�9��m�Y���A�1��P����d�t���T���_)�����5'45��h���?i�=__!�J��^�6�s��\	a3-X�����z��K�1��I�H��kQ�7_���$�;+�����I��6�T|@�m*Y6D:6&XFT�J^�������Q����p��-���$����p��u$98y���
:K����	/��p0���������	$iR'��"�!�G!Oq
$�+����(�K�!�9�"=���K�J��Q.<q�1/���&T�l*!��O�*Hg�N>��nf1���7��;ii�_SM
������y���J�W��7�x���H�u6H:N��x�w&�}�m��!@�i�V8p:Ly��|�YhO.�=��|p��%�v�V:�1O�@rT�������m9��D��#�s7� dC����S�AK6�����R�:���B��WH�����F�L�@�!-�e�t����b0�P��&��"��{�@1"')No��7����H�����><���"�����y!�y��j��G��&��r����=��&�1�7u�i�b������"?������
m	:�F�9n��;�	=�7)���j�mgl*�&�1���[Pa@�Q�A�)'\=��Z*�2���!z�p�A�6����)�2�5k�fC��\�T���v�l*��*8N�Q$�
f�(���|O6���3���[T1E��X�������4�5oJe\�4�H[����8�R�_����(}��4�*�����fA��M�����u&")��)�4�X�t���(�,F�.]6P�_��Il�]��6)���Q(����Iq�Zl��xP��8J���W'	�:(=�27� �C��8��|�R��Q�M�c
�"�WG���Ggi"I:�u�Mj�p\{���}a��7d�1���H�@�
�����!v����k��QA�#%'Z1R�w
�w��� ����R�)���wj�v<=W�����[�H|�v�0)
�#=�����gPk�8�S�C�s����l	��\�d]b��V"]���vUa�M����+!5+7vNl�����|���;u^j������af�Wm���dyM$8��-7��5Z'3�>��Gz����w�h
�>�^���0h}XR�5Q�Aw�@1K:�J����McIgG��'�un�u����4����S_N\���1����=,�d��tMVRs�=i��>����
�N�O��J���7%�Y����O�����13�T�U?}��f@H���;������^)�T�/���"o0=�l�H���Ve��k�_��!�g1�[������;^do��;!�@�*���#�+3�X��k��I�����A��Q�TnqFPt�P�]����q�6�������o\b���������$�1F�
�4X11�\{�Qgu�q��(�6p�)���`NM
@+��[L�����r:T	����G��bp��������Us�M�G��E��BviG7�����8\-�����6\�
��
L�����o�����^n{�*�n!#�hx���q�XG����np11�����a�M����C5NM
*�����"���oZ�<������Y^&����`�gv���B�/����)*�^����@�I��E���U11����^��qNn��H�-Hhm����0�;�7.��
�E�����N��V+RRA������2�^�/j�������� uI��B�[g�/Y�M]��q���H��y���&h�T�6�<���$I������UL�c��Q�MS����qxX��-��nbNm5:V5��@����d4gqu���
��\�Y�(�7�c�V��s���NW�31���a9�N�8
uC���bb�O��5�Y11�H9�zA�pY�|7�:�jB�+&�Q#��3�CP�bb�u�XMa��qT�=�%P���{�W��a�u�j��5���*����^3D
�%���4����V�����N�)�f���%[j��U���GSQ`���Zk��P��3zL]����?U���<d�D��T�SQ��2�^�|A�3gUHTq���p8U	�;���|!��j���m�c��pB�]����z!(��g=[7
POGK@������:ZD�M�1�����\��w:.��
�A-�9u�t�����F��9f���L�d8f���A3�]���A����/��"T��[K��c5wyA,b��Ji����*R���*��G*JiTe���������U�OP6���>�J J8�6�9�����F���X��u��>/l��A��D��`���p"��:6���z=��]e5�
VU�� f�Xn��E��� �����a.S[T^�\�s�F��G����"��/�W�����V���V{�����T	�����W����l��]��Aw}��J�t8:+��e��i��p�����t���w!�%N�bs�����Bw� f����SUQ� f]w��%�Fr��Z|`���$������X2���TuyJ��P����T�xu�>�;rw<���|s]����,�G�����dk����p%0��S��Y�Y{\(0}0�r\��pE��.���U'���@�D8���q	��n"SZL��%(��i�3sx��PdL�Q�'e_�F���r�&���������t�&���uW�\o+=��SA�-AG�)B��YF�\��V+=T/�N�d�e�����������
A8BuS�(t@5QC[�������P����QI��(�����Y���%�E���p�Be��Hi
���B�����j�� Gi���V7�)��o�[�����#�/��|���.�9U����Z�q�}c����R7G�������\�fA�@B�;�vS�����e�F~V'm���w���S�����Y���Jg�n�g�QV��|���.����[�)��za�SU�av�*����|E�ja������2Om���D^�P�vU��Y']F��c]o���.���
��8�b��G�w)���6�@%E.	�����(!��S]�������Be5mT3[�&�L�Q�C`Mv>��3�%���'8����)�P� ��[>]��������%���ES����>7��N^}��aCM��r|b�NM)��EQ3�\�k����CY�0����p�w0�L�.��4�`�j�y	@W��v:��t[oB!��Z�����Ok��
�$���&����\[��i5��*h���Mzr�����1V��<�a(*�hP]�t����xL����x�qSra���g����P4��1�����D{o�-�1o�t����m�Ut�n�g��p�2�x�����G�Y�LU��nR.W*���{m�g�r�	w�$�� �W����W�j��gx�@
�k8c�;+��v��A�����cz��xA�����&$M'�nL�����I�V��.gByb-\X7�*�j�i�f@A��S4�\�$��A] 1�]��!�a�M�h�2(o�u��p�}���	�u����]
���"g{|'��P�Fz�'��z�w��7�M��e����L��P���,�"������t8<�T��CI�M�cV�SM(`k�`�t8<�D�X�B����'�"b9R6�Y�E:Y�2�M���Q�����M�c*��t<�b���T����
����}dU���h7H:Q��P�^L�c����a�UA�?���a�	�� :�T' �M��m����!f��8&MKk�
��L�#N�wH[0{�KK���,��x2h�fA�>(����6!����-)\$Gg��8Z��D&��C3T:Vj�F�kI�j��c%������J�����9nr��Jg_�Z���LM��0�>�`���:��� �X�L�3�]nJ.|�"I��0�R"���� ���f?Ti�_i���Ka:�
c�>�~z�e�/����ol�z��(v����5ylJ�m�I'�!�J���06�����I6��L3Lzd5w�A��8���_h�$�*��Zf\�>�Z��s��SRP������;P�HjF�N,�X�t�b�� �Xiz�z�kI��O��S�r+j�GH2@�4��m2��a��A� �)������f@���h�r�7��St��	�������*��Q;���` ��4�Y7����x�@�V�i�QN��=9��*��(�gi�z�O4�a*;�&�'^&cm�����h���
�N}H�m�������(���pD�I�u��N��N	���a�i�����^����\*���y���l*
��t*Un�&��1RW�%�Iw�:+8����L���3�����-M��44�������Pv�&�	wP�H�`�&�1������L�f�tJ�r5o�i�I�Y�G�D�jL:C[ �zL���S&�g�e��p�$^R��������<�?mx����p\U�d�6S�H�`�-A;���mS�(��D�}�i�
��������4�O#�����u���_��Q1.�?0�kY��-V�>5cr��G�2�9u�����N�W
���Pt�6!��"=��4�� �DIb�`�d8z��?�4�z=����d�

��:E�=��e���v"������o���H�]Nr�����v�^���S������>�JjI�
����N3D[z��6�tbQX����W1����oG�����:4{�V���$=SSun=<�����X��%�)���-H��|R��kx��pL\�����z��=�F\R
�V1�f,�q���$
�m�������EFd����pL�&�-�x�	U��������!�\� �����+
z�>��`���FQ���4�
U�������p�@��A�M��S����]E��!�ig��U��N�u���j4c$��2�:`�zh3����N L(H���A�aU2��������a���#;M�M�c�u'��A�}7A�XPi����q��#�}������ �3��K���:�1�Y�K�N�t8*H����*aW�t8"����&�t8��
d���:��[Vz�0:W�M�#�b��zU����1AiY���Z#29������GH�0�o:�h�n�w��������y�I. ��Q5�;�F��P�����hw:���#2�Y����j_0����$;�������,g��x]G�F)�pY�,,�u?j�C���0�U�j�Y�3V�r_t�.�v
�BD���Rf9Z�0�D�dOi�+/�9���9sp�]{\�z�Y����=�`@_�s�4���'����Q#�VX�R����Q`��9��	�7����0��q7��,�n�V��1Y���N4H&g��h�s�U���������Zb��|�B�[���+��I������;LG�{m�2z�j��J1u��p*�<�d�twl:�1�sUu���G�9i<�m��<�������������H��(�0��k~�B�����j9��#��|\�s�)�_MN��"2E�t8fM^��_h*}�t8Ti�*V�s6�B�	 ��j�e9���
q�����,g
�I�7������,���GJ�HC������t���L�c�������?���g)LyYiE��<���4���-'���&�E�����/�o(\G��G���U���p�0B�1�����pL�O��v�~c������:�5<�b��X(s�$m��QWn�g\��,�'.7��A�;9
��4=�oB�'�G	�%�&��
��Gv���h=�X��[�!���jUzQ�������U�Q����Za��7�����1�Pdf��}j,(��n�]b�QT����l:a�q���x����B
:V��6�y>��2A�������W1�9J��
c��������c��:�uE�~�1S������7����i�G,{���5]��+Z���� ��z-&�3��M������&�����"5�4k:�
�o*�(_�����8��ic�g�:C�])�������M��: ��Jb���M�W����'�&�1Y8���������q���N�Hv��GY��r���oBLU!T4*��*i��mB�B���c����5o�z%%K��}�������,���{��O���H��0��5��C��������K��=*YZ�}��)v�������FH���������%�;�<{��^����y������g&�=*������c��\�{9�	���n�x0h���0�nU�+^C/����Y��Q����I��8���@.���F�f��4>^�����Sq��k!�|�
�;��j��7hcY;�9K)U�&Yxu���)L�P���[B�����N�AX�jV�����IL1V����d$-��1ME=�I��������w���6e���W*�$L�c�9@�l.���is���a��u�F�vr�#1hX��Z������H�U��V6 :i]��R����ss�l��*�Fk�t��:���Ta%�yw"�����;��
7��A�*��ia��t+�R�b?	M7��s%�[�
OD:Xg�t��n���S�����&��<�B����|n���	u8V�-T�I<��)~����������}��T|��z*���Vq�;�U��M7JP���>6d�q�oO(�����p�q-w��������=��!���M��"BQ�R]'wO�P����/���;f"M ��a�fI1���Tv���I� 9s�l�������(���Nx�;�������v�O����Y���������*�S�V)��f!5
�v�B�S�
��S�R�k,i?��$0�x�����kA��iwL�4]U�4��_2����S���;f�)mJY'�g�&�Q*��)�@o�S�H����U��������/*+b�zE�a��P=���)4������^@D|gp��vpf�����zG�s}*�3�_[.=���,.8zc�U����@� ��5�M�c�%��T=�O�<'$�����x��P�7�9E�����<B5��b�^
����j�5��������<uo�����l��
�j�zM��������n9��$����d~�!K������;n������T�A#.����F�����@O=������.az�fJ;���z�q�������G�u�6]�yE����jz�	{xj�%*[�6�r{�M��3�;���Fx��V���N;�M]W����4������ExLlU;d5�	6_7-�?%�������{A�Gf�\�S���>����h�PF��6`	���XC�m0y����1S��M��H|t��G=	���1
Xm��&}3������mc��XT�c&:��"���������������Z���y5�~���E�uqp�O�����!P�I��f�K��V�P�"e\n��uh�%Y����r�F��������4�,W�H��ob�4���5��9c���Rr/2~V_��=v?^�+���rC�����c���
u��	����X���q��!"������q���mN=���1S�TY����_�d"]VO��3l�wl�}��w�o������-%mzR7�M��k9����]�����mn�=��=h
s����xT�bB���kt����"���3��"~��;y�&��'�����D� |81%�h�WG�+c�q>
k45G�{'�;
�����_
��"��
����)Vo����
�`��@�c���c��"D@�5j�{��������'	�`���/$���F�T�6���s�a��U[�0fBnqBud�7%��H���qNmR ��y3�,"�0�;"�v_I<q��G� p�M��&��6���4��?,�������"��u�0�2���7I�H���L��~����?d��E"j(�Opo��dT�e��p���.iY����/�;;T6�n^���Z��e"�_j�4U�T��yp��li&%��"r�fK)S����M�l)e�U���M$G(��$P"z�C.P5#@e6a�N�A@����
D�M$t�:
�'��V��}g6���(���$��h-��T/(��I�0=�n�Q
u�\���I���gm�U�
���Z�Y�{�B=&��������L%C��P�M��
�;�8�����!����������V��/y�q�I�������|%�-�!�p;�\�"}QUU��zo�w�-��`���{��}aFy�36XH~��	��^-��]w���S����j��w��z���G�����[`�����)�V��S�����GdX����S����s�{�Im���xM��CQ�q�w����a���[v`5T�PU��.�V�/��f�I;����--%����R�an�c��IS��^�F��}�Q���ua�N3�G=8�AiD�0�<�{4p�n{5�4n�jBL��o�w���Jr�8���a�w�t[�0<�j��;�T��5������x||AS�����\C���|���<�{F���	;�$�Y^���@�6��������������2�$�T���'���.�����@�h[`h�0���0"�y*y|���)z�BRkE�3����IGYPS�|3�.�AP��Uw^��7C\��_[y�K�]RJg����{�C���?;ny�K���B���h-q"�p����p���D��-��o�p[E*�a����nIA�G�)FC?�g�A�8�>�O��E�n�T���9��ACL�H��~�n2����3C�w�Y����o��0"	��GT�>��|���6�ybhk��',Ie�l/Y���n��c���-RY������n!��,b#�V�x�V;#'Zl�����n�������}[;2�3�W�_��g�������^���o�F:�B�V���{�^�	-_��fmG�^��0}�Aw{�h��M��,T1��<�h�>�Vi�RmI����m�fbFt/m�����<1TX����}kbd�P���Y2��J�FJE^���#s-�\�$��h�#�f�j����3����q����C��4���m!y����1=L��	���nZ{�qzb����g;P���e{����{�����B����}��t���9A]-�#�1�?E��l���[E�:���Z��w��9A��:���Z�nq���[I�HW����=�3�#��G��I���-z����b$����=2X�ipQ�L�%|�}l���,�#�C�g"?j�j�����{�v�lo�;��	�O�v�l�k.���O���So-��L
BU��Ch"��	,��O�����MT�l	!Y\W\�;�����i�����^I����1�v���#5V
������&���8��Q�w�!pM��zt�^$��KE��Y�k9"�Xz�AT��aV8��Wp������v$�9��B��.\3e�/�e��S?�T���g40/���@Q�
|S�����R'f�/;E�E$�T_q@����\��R]W=�
x�g���bR��Z�Ndk��^W��C�X$^��F]��4&��- FP����U���ojE���g�(���jl�^���������^�� ����gs�$���Qc��������'6O�L�v|J55/����S�&�J+�6|W9�%����`��}T��$�;��#b?i��������h��"e�������p�Hi4;��:�����4$Z�m�Z����/���M�c����(�R)Y�T�K]�pk*-����U�<hI_K�BMC������]?�e�iz��R����Q��:���G����P������m���q���UuSkWR:A|�Zil�����������&��7U{�U�|m�I���>8��>#��
$��4<���T��Y�`�������-�C�d%�����tg����r����{Qy���]�;Pp�v-��+1�L����0}�V�X���+c��9-�@��b0p�+)8�]�^=I�+�d	��*���TB�����T;j�O��'v������v�l���n�oO��/���A�0���z�i/T	��M�)w�����#���]B�?������;���U���?5]�u6D��+
7����V���$
w�]X�$����xJJq��(
�F�c�n��y�����VB���|L:�k�F�0}L�>A-��k?w��NC���_2K�F��)J��v:2c)\����	��Jl@
^�n��J���uT��t>��D�3�f��C_�HP'X%H�,m������*?9U&���m��9���m�Xcu(���Qx��m�:�>~��E9m�-1������K�v��D�H�$i�������h��Co��d�'c�
������9Q��@|�'��nu'f��])4���������i������`�����K���
�����l����vu2[���qy�:���������@��(������!@��pNL��gj,��J�_��;n8i%�p�����/��7�zX^�^_���vg�-�z����Hhm=
�>r��6�$����:���=���ktu�L|�i�nQ���P�:�^�����v�;F�!�2s�o5M�-�?(`�bA��4m���l���:��G������a�wdiY���|j��`��|K��o^�cW���Uh�_(�sd�fL7�<GSv4o���;����:�����g�m�������y�8D��k
��-%��l�J$�!
S�~b�o�s��_z�7
��c����J�|c(B	+���D��ro��u]ZJ�-���s��V���[x�������/���0�)�����N�m��^����XJ����	+�!�����<D�(=+�����W5�Kw�Q^G��YMk_��bK��aF��5�	X��
���P��q4�]:N�4���6�n8�`P����J������
���d,4���~*���O��OB������g@���J�V	�
f�p�d��<V`8�:�����k��-59�W���+>����G���{��*��	f����S��h6�u��O�����\�6~Vcc���

J��p�_������������a��i�����
%�B�ku���<����'\q������2�;^�#��J��H����y�����eC�]�z�<�������L�\��{�d�A#��#�
�~D�mg���O\3Ci~���	���J,��`�����
���M��V�W�R/A���?�{����� �0����V�6��`P�){��)?�k��){�9����~\��\8�k]�g�o�Q�{���+�{USJ?��3�a�<�n�9y��]�c��QEX��J��`V�x�J����~�'����Tbk��kL!T!r��4��0#@i=Uo@�!X2:n8�����]���4,TV�2:�
�y��J����X�!6�U��@C���d�<{H����xi?��o�)������/����Z��Yd�)y[�ZS�p T�%�4����&g����=zUL�M��=%o���h�qaq�J��X�
��
%:����dO��cW��( ��}\VI�Ou(���M�~���N=�t�.�O��C�����Z�=�M"����{t�N>��i��y�PAD�,h	���@%��6jw��.��G�P}�M�=%o�?�l:~�
��	
�@���7�?$�>vE)y{'�����o5�����cE
�6��~Q%��X?'��w ���u�
���:��s�4��~T��6�}�A4?��s��y��kS��{��\;�/��W-���)AFCmV��lR�~4!2\2*�f(�e�������/>w�<�a��&.G��Q4%� \�K�
���Q�g_,�F���[��IKY����.���T�(]Uv��~9%��+�`�!�����o�f�U�ja~]��-��!O|n��uyoW��@���+-m4��C��nx�'�F(������!�#X�����Y�T@V�
_>w����@4�}]���4��
�8���M6R��
W���[TrJ����#�0��RE����R��@�8���J�3t39�H>��=�n�y�Km���*i���3�����o���
>�}3�3&8��G����T�y|x�c���7��hj���F"���`��B�l'EF��_����G����A
��C�P"j:E����q%E]�l�g�~���r��$�!����T?���6q�#?tFOyLp�*IN����	�	*1��;��?�'��5p@p�b:�Ub�
�)�my2%���
��j��)�����J����tU.<X:	j.����$U�Y�2,�&�*iCg�����{��F]�b�b8z�,I�/��X�Ff��g����>T2�XX�~��	���|��'��s;z�7J���J6�u�����-��En=J�Q�g���4@�O(ch.��Q��c�~4���}>��V{�R�������8Rf0��Q�m(Z`�o
��m��PIe��6��P=A%�#��-R�	*��WN�W�?����
�x
�c�o�T�<��D(�}�'���c���ocKPI(;:r���6A%@�Du,��h�+�����oD������
00�$��l�Ed�@m��ZfY�����-��	*�,��0������Kk���C��T���H�?��	*	��V;���jOPIT��V`�	�b?��m�}[7=������[�k������
���5{k�^��cu*zN�������B�>E�E=8~���2�e��w���=�������:������	i�Ej�~e	*�]�D���<������)��N���eG^.qK����eo�k��5v��G����	��P����-{�;#�R)��<�5*Ux+��Q���)o�>�
�#n���W���+����M����>�Q���5O��j�|��r?*��v�J� �G��=9%�>��:�U129%��w�T,����1��g�-��\y �9�^�|#
�p$�gR�$��B��l1Y�v3?UXhB����B�V�d�by9%�U���90���f���0��Q���mh0��������e����U5\�7����\�����e�eY&�������%a�
r5�zCE�����_vD�N
/PI)l9��||m/PI�$X�4����G����W��D�;���zA%�
�H��[����j�������%l!M��~A%q� /�r��
���wc��]e���?
�
��,�K��z������:����O�Gy�#�	����V9�����LM���	ud�U��M�k8
{��%��i#B���=������������I�|/��pz�KT
���	
�~��%RU�Y��������q�rT�_��e��P�^��nl��J����L'��aG{=����{�$q�
�����l��������\]W!V=�G��]���t~@�q$�����SWd������I�1�5�����8���)ldm\4�qOa�E5���Z�GbJv��
���XhF:��a����&�����^������q�kA�����Q��x�JX���iihm�x�JT���l�����J���nb�4^��,.���i)tES�
�������^q��
_���+���	/\W�8j�F�f��U����u�h�����vM��C%����H�v��[��1i�0<z�^��J������He�:����8z�w%�Mx���n�����8Rk+��RJ>��
F���q)%�q(��;���c�3tp3�����b��c�RJBBV���GG:��fCb?���#�u��yph�6.���������.�����R��J���tl�+���ujpG�U��oT]�|�m�@j�}��sJ;>/)�p�F�R�yF�=��~\HIt�������7������I��^�BJ�w�@����U��6���=�Ux���W�P[\�������j���_�F�Z�}�>@�
g
w�������K:�[qtXX�R��.�%����#S��t�Qo�Q��&���-�l6�Og���<	/^�����y��R���v�����o�M�x�SZX4������Uu���ET#������a>[8=fA���������'�Y������P!%�^��������!��6n�k����?h�s�tl7�����j�ax��#5�I���rJf������rJ>%����P_NrJ�2L����
�������H����8}�o��e����ef������0���c-�����l���-�P�{w!%�U����y?��X3�����K���I�107|���J��,\5��;�;�����x[!t��UI����	i0����������M���x� ��H�~����[�uZ���QF����%���\��rw��$����X�.�$����}��<�c;�Q%����y�:�<�_eJ,��������PQk�V!���l3AU!\P�(#�KZ�������#�KZ%����+n8+o���E����q�KF
���/��G��k����!����8��Om�D���	��>�l.
�4�����l#��7b���z)%>�����Qx)%Q���|3���%v�"K���V0��wU��H���C�@��Z�2S���%�XN�'b3��{u���j�|�J�/NQ��1S�������e���LPIYK��h�M�&`�d�mW:��f��C�"
���&�YR���2��}5=���R�7�G��x��R��GQ�</�$�O/��WM��������7�����Q�V�����y3�t�pm�>�����PL�,?����(9=�!��~��'����2'���7tf������h���K��%:�Yo%	�{��M{m��7�����RJbFU���(��I)!G?v�x��0)%OBq18p]3��pn�D!6��L�13eo��&`U�O��P��nD�5��y�byl�[��^���}J�������g[l��'�3e��V����.�nf��7q.�.��Q�}5�2�t�bg��f4]W�hp8S���lu
�$���&���F&�0�p&�����b`0D��L8�G���:��L���8�U�z�B�����0uooSp���*�yu�����Y��3��s�T��`c��f���M���5��i��G?	�a��"��g�����U��7/���X���������PC	����$����*�oOrU�J��z{@�u�����i�L�������������y��;�%�4���{�����G��*�$��M��@1���h�{���|m<X����=�X���������s��b]OA�p�J>�~�J0����lNh�9��R����������/5Eo�2db�h���-�F�
k�jVqw2D�v�aM8��������yI%Q��_�'
���yw-�K8z`yL���-T�>���80�������$l��g�V����
�Q���~�+Flt�HP�P�����]����nlY��a�������Y\;�s*�����t��p�T��uy?w���8�\���i�^Q!�0^������0�i��y]��:��T6����%~J�a�r��������X�0~��4n����we���(|�����W��*P0Y�mG���_A,kO8�q��C!EQ�<���6D�����w\^2��3lf�G�^�C���^W}FT�O��TX�A��T2�pN������Fo�T������Fo��;T�(�-��\��g�����������*�5$�'j3e�^r��C�����J��$C���JZ��-���@�?6���2&� �uQ%:�=7�Q����:��M��X������
E@u�g]U
y%��tEv���w6yl�6G�_+�JT����
�F�*��d+���0U
O(Mp
 �
wm�����8bR;;<����`&�6c<
QJ�ZU2PcCGQO����0RVS)k%�$�ny��Kh"�U�����C:/�D�lm[G�X��*�
yF������C��N�P����'tb��O}��n +Q%K������l���U�n�e������2��>+<]'�u{P:��hZ�1x�CME��	m�G�g%���X0tm����G`4_	NG3t�9�:��6����'����q�x��=�[����8p�Y����n�$�-���W�J_��{���
h%�$s%y�Fg}B�*��{2�	h��9!Q%q2Z��@	�|%�dW��p�,=�_G�^�aY@.���iv�U���' 
�H�k�6�����u������8 �O�(H��n|:���=1�� �>�NU����'G\�H����j���a�d�9^�6Q%8}5��c��4��.;��:����c�<�9�h���E��$#��y������vPo�����=���3�<BG��!�
��h�4�4�#���
���gz_��5�	�E��5�c����%�������Y3Q%�R��
�D�����A���#��{��!��H�;P2CG��k�
�5Yq���[�����r���ZX6Z��q%�7�e������1Y����q��J���1��Vp����=���I�~+I*	L�<!�|;��l{HQB�Z_��F'iYS8��'�o%I%e`/X��l�_�U�+?��v���ADa��,{9B!e������R�ye������U�����'�Q�3��~��B�L,�|��Yo�a(�
CS��NawA�D�xElS.~E���\;:�������^�mRN*Zy�������V�|4�%g��������H��.�7��G�A^G���������=�-��	|H����\TQl��$��`�9��F��}M[�WiPm�z�|-cb�H���{P��7�J��L�v�bEp@
H���]��,��sc�%�(�&���'�&����A�P�.�d�I/�~�J
���~m/R���m�=NW�~��]�����������{T�
�u���~�G���fI]�7����������?Vb������7��\�RbL@k��6�����^�^����>�����,%�m�1g�a�N�v���XG����}
�8���J,�����g��a�J^��;;K�P��k[����j��C���7�th���R4���7�%n���q�C�|7�VI�$YC��}9%��
�p�#b��K�������G�vS3��Cw�PG&+�?�p55������h��(\�vN.�'��f(y�i}�r��Y�~z~N�N������������N!; %Jw�*�#������c!����fh����]����F������N*Z��>��tl����Z���&�vc��{�W��zj��5��{�=���}��m��JU_Y�Li_L���L~)�����I)N���A��}��Si.qU���+aJF%�OX�������W�=����J��� *�G��Z�T���b4\3����V��'&4�����|���2�m+�a<Lb����_��G@���z�-�h��:,��#�-�NY;$F=l0.���;��O6����������]7-��S��K))���3��s���6��5t��x'������
����h�>
��)g�&��J�@AX@���������;���
,������_���w�������n�'	2��a�%K,��Nf���k�����C�w��[4ZWZ�b
7�zk����>��k��������~��.u�����P?���01���$<��&{<n��!�����q�Ii�oM���R���ul���p��$��U#�u�{�h'J�2��U������ti���V������z����7��n�D[!N�v[����5�xJ�,Q�T:{=c'�~�j��G��|�}9%q���Nv����S6_}9PTQ`�/�d�'�2������
-�!�������7���q�CO
��N�����J�r���/����-`�
j[����
����a�MRUz!�C���T���.���i����}phW�r�a�-��9�(9^N��\�(s1d�a��s�6����M������M����v����k��x�R��@�����x5�b�G�)�X�7�H�D
����i� %���b�.9�nk��\+��v��?0����Y)�n��4��-�>9�n3U�`�C���z���
�xP<�_��B�	v��i��)������^����(�- �������=�	�����tx�$�1u��bz��IF���A���\G^����ihS��c��J����h��zC����s-w�����*��K�g�%K�i8����Jj�}���s-�j���MJ�Qv��������:x�7]�|��8�[�k��}�9����M���-��{�����'N�>t�����\HI�K�fP���j^���������34C):u�\�B��V.�$�8��S�gh��(�0�6��$s[8���~00x7���2���IN���m������>�T�������U���������������
�n���#;F�W��LxF^�rW�G�7�9��RQ#����(R����F)�h%Uo��\�
���7c��,�mL���S�����������m�%�Da��N8��B�7�����(hDS���sw}M8y�m�=l�g��^2"��[���V��'"xK��4�d���z�W����Z��k����S���
C/��5u��|���BJ*����P������v��\R��F�a�R���u�g�E��f�dc�����]���s?y	Q��zu��3K���>SD��uS7�������{S�@�b���yRo�l�1�#S�~��n�����C3��&��/�1����g?H���R����]�\����L� �F������z���U�����S���t�V@��WG��O��K'����i@����W��Wy��[���{��x�T��s�R�����3�l�
$�V.�������a������&��;���BJ>D��xU��.��;���Bk�[���Z����=Eo���N��������c�PU0��m���7*?L?�#z��{R�K�K��w4���NG���:�������I�U�W�7s�r=��� 4�������4McndM�������r��u7)P��yH�X`��Pe��p��E���5l��o�BJZ�O����	�
)�PC96vbRb��������2�lJ�AC���&5�u��|`��yi]Ge�^l
�$����2�R�-S�?m���\��	BJ�rK���e�t�?�w�2�������{��-�
�2��=����\`y~p!C�������2?p��0w� %j��r����BJ&e�Q�����%V�.m$.R>]]1�xW����=�)�d>���K\�\�Q�����k�m>ND������@��FS-Y���l����&�����_������/�F�I���4�UB�32!%�jk,}� �	)��C4���p����������xxB'�����0��m�����<`�4������������Rm�����O(���i�����&�d���A��H}B)vO-���wk9r)�����l\�������)qybh�����bZ/��34��p�Y�M�I�����V�uQnDL8C�G�6����t�vk�'����eE�<������j���;�f���f�I��l~��v!%,�Z�����=p�f(�]XZ��:�j��k�}��}�Oh�H�0�%���4NT���'t!%�Se�����BJF�m�DY�RU�&��7 �������o%!%��v7}����9�Q��4��'��%�0\�3
J��D�\%:�6��<�����iN]�
_YBJ����U<�C"��D
W02!%�H$��lR���x�d�.��#�e�:W/���nL�L}D�]��cs6E������k�#tk���H�;(��1p�4�M`"���u%!%���Df.��G��V�M��BVs��m3��$k!3�V��x�V���o���+��i�o�j�}5��� ���+N�T�G�6��{~Fnd�	)���i+��j�;0����>u�^R��Zd3�B�y!%_�"�0S_H��+3TM=��	)��(#�������,�*�����pd�Vs���=@������<j��V���������e�b��_���-��� ������%������XFS��F�
��&���aC������hw�5$���YR�[��~�"~,#�"���P��k^FI%
��a3�c\FI3L$4��n��(Y�^��hg�g��8{b'��������T�}
�?{o��_
����d��U����"U����/�E(Y��| ��:�d/��F��g(&����H��|�z��_���s+��������!�[��f�����*��\x8��m>`9XMba��R�����R/���_�����G��n�}�
U����d#�.����f$�	���/x6��/�$������R��pq��9 �� %N�k_�IT_�'����M'�)s�1��������3�����h��]���
�,;�}�%��tI�WG������NM��{`t�q�Dl�H�D��[���?}/4������K<��\���z����X8>���g�m-Y�V�N
�a�����a��RC����
Xh�8�����]>-�6�$��1��5t�\�r3\u�'<@e�jcx���[E�����vI%�:�{�'��@<J���g��"]j�\�ma�=�u=Un��.��~*��1m5�n����������!
9���p6����E)r��}:��%�VE��;C3��{����'�%"X�T������{���L��7����,m��vI%�cB!VF�3��Jh��u
�#h������z��g�k7�G�:�����t��'���{��U��XV�%�t�S����]R�`��$e�7��PX�*4����vI%�|�Y�V�����/�

�6����p�B"��C>������_�5����e��� ��04��NLo���b�Z����^8V�.���J���f�}�#n��}���1���f��m��t���q{6Ui��=�
��z�)q9|��Q�K*�#j(�V9��H%^I�14#)�����;�<+-�?��.��&M�[Z�\/��A9���vQ%V&�KZ��Y�]�����Gf��J�3���a��n�[���*�"E�
�i���F���R�3�����iBb�qX�Z���N���g��-�{4�KL����(�?gu
@9��xH��$�'$E;�*n���l?���N��Z�Zz������ ���7�hl�� 

�l-������O�c��z3	�A�/��i8����c����A]o,��U�I��V�kD\T�do"�7�K����BA5�v$�Q���nM
"�H���ZWt��&:r����o5J���^�rJ��~](�W�J�����<�Q:��yCi�,�sV��%��Ey��4��u��yC�����#��m�P��6a�v�5n6��:�Q;r��-��U�i��U/����Le���Ti8�,0�s����#��z��������=�cE�*�~���h�;���6��K����{K�c��'q���]NI�9��7��d�=�(����1�S�~/����S�w.$c
��m-U��;R���.�$��rBm	��U��"����M�8��N��W�8�lP
��.u���z{�8$��I������}�\z0�����SJ��U����.�$*:��w����b6V L"]�[����5�G���mk����P:[���e�p�Wg�]VI$l:�j�a�my��mb�
��b�D����/�R�.V�@Y����mG���vO=C��E�S/Z�9�y�`/5;�w/Z�=PP�Tp��.=9"��Y��u�-q3�0�[;�����w��q�����k!�Y{����)��
�x`Z�e��rm^U����J�u�v�'TvY%��N���Ym$����	M���*��:�t�����.���_Te��(x��lG��u"�k���J��fG��u0��j���!���
���Cj�E3���5BX�Qe������V��-��6
�x$�8��vcI�}�!��/�����Qp���-Vs��CO���P1'�,����*v������{�JW�v4;���m���N&�v��EO�-c����u�?�w���w~b����:����q������ma����rn�7���O{j'���;�6��v�9jQ��X%L������/V	K��*�=�]VI�@tk��7�J��6�U�0��+|���h�]��k)|W_�����"���/ye}3�c�h��Fk�g������zto�VV����*�em/��CO���a����FR�	D`�H����Cq:�����������{�U��������i_VI���6c��>����(�7�6��-Y%b�ro��3�p��|X�*??�Co%�����Nm{��
l0�6���|���i-��77�����4:O��@��j\'�6�e
�`���{{�@�H@W��.���,�5&L��Vb���FOg�K+��Q�)8�vi%�Ep�[���,�vb���si%Zz���Z��c#��Zu&�)g�'���oMZ�d�,z�76�I+�Jn�h������<��]�;�������8]�}�+Gf��Y���A������wsW-�^����L�cH b��X8�����("����N������5-��D���p���D96�V3��y���gS���>����D����F�9�R/�7�
z#��X*�?$���6��x���`���}����.��u��>��E����Z��H9�cM��)x������5���������_��HL��G����2YQ�m=����J����M���RJb{���s}`h��4$�����_z)%9lE^��]JI����" ���si�J��@'��}���47=���=H�z���k�����3����v^��������Y����hk�<�Q�EW@�X OLI��dU�n$��v�/���U>�����2�t��S�
o�����(����788��Q��_L���\f_OL	�r�]��[G&���]�B��S��{���mbJz����G��'�XG�I�>��)����&�����-W
L	�F_Z��_��k"m5�������l��<1����}�(���Sm�]G���>w����Y
����p���}�;1R��(�+0r�n�� ���g<eA�0'�j��tu���S�'���'�.���4S��{v�,Z=7��z��AG����"-}���yP���m��}�V�]��~1%
�������SoV�\�DOLI|�[GBQ��S{&m���<1%]���[�1������;���?����l6� �����"@���q�3M��FG�X��S����R}$'G�~�#Q�D��t�a�5p�����S�
h^z��G��/4
�������x��=�51�b#��!������Z�*��\3�'U|�t�m�t��W�>[�]?;f�j���g��'�����Y�@�|)^G�V���)�6(�:'%�yTC��M�oD���<{��Y�gM���c�_��=Z�yb�Y�a,i�*aJX�P�T��'�
 u��k�}�k{w����T��_L�B����`"nS�J,��6�F�h8V�;�	{MC��@6>�
����'��<�����P�����7��Cc/�/�5U�����Nu{�i��5!!�!9����]?�p��}.5�����T)�>L��o�RJ�7i�j�������#�2��RRx
RQ8�ju?�����^S�:�9>��l
�h��>��5Ss��B�|��Q���t���Yrw�H�j�e�aTA����������C��?��Q��:����v?�f�����]��/B�3�p����z-�����z�#^�M�g�F6���M����5F�ij����UP�2�h�_���WS8��^DI�Gdr(��N)^�K��`�L=��(�rG��+K�1��(�3l�z�y��(��iP9m_FId*@R����n���-G�\�/�$:��o�$
���z�������
/��2��5��X����X�6z��{�F�#��0o
o�F�m�-����|�$[����5YQa��� ir�4�2�������8���5=*j�|�nP���	j�dO�vyv�#Bg6���'�	y����� *����!�/<��m���`?������O)��H=��%�{�W���6�o�I��X0Vr,R�E'�0�-l�P����]�F�#�� C��j���i_&��`��'������O�
�QV1����2@F�������R�l��w���:�7��F��_�/���d�F��6������J����! Wc�J�����,�~Y��&�	���[_��.�BwLi���Q��^�P���U�bA3%��5$.����w��7[Z�2��2/*�zK	� cW�i#�9{O2N�7`<�������^�C��=hFi<��G��]�Ld��� ���O;���!�����cO:I
��k���_:I�r%s,��f ����K���D�^��^����P;*!�4j�'���e5��[�P�ZCaMTSzO������������F���T��c�${mP�� ��b��)l7����m��g���'f�^E,dw�������Mr�m(mu�`')����]Ti�BL���;�t�zg2`�t���v�cRT��O��?0�\u����M���=:C�{�� �`�����9�����x����gW�����]�X} qA�o����-����Ti���t��S�*e��&bR��O1���������n=���NR��`����K'i�e	H<^
�K'	'*���������v(����K'��y�J��)��	�&����.��)�{�QP�^�QJ���Q��
��t��g4�BpM���#u�J�bwU 6��������<}�I*�s�.�i��N�� w4^�����N_���sp�/�$���P�@!�/:	;�v�o��^:	Yn���JyO:�N�$\�����J&
�8$��NL�u�JUU�G�������������-A���!�H"w�����f<�T����A�
������@Hplt9?\:��_��|�����-�jJ��?�m���z�lt�z���il�{�)�#w?����5�<��I
3�H��j�U�PD�CM����'z�w�]0R�n'�����R��S��0J����S�����g@8�p�R�+z��#s�t�j���4�[���!N�H����\�,�;n�r=�K7E��^�V0�#y{�����B�f��v���t1�����x�@Q(0��8G*����'�1���`y��oU���P\�au\���p��>�"����z�U�Q�p��v�����q�$���g����q�$Q+�������t%��w:���J���I�l�.r#%��`��/���8����Z0�o����B���M5�R������,�������s���|�K��VyU<���-�J��������
�d
i���<.�d�!4��8.�$�O6�
ch������e�����K����-�:t�G����S����>���gj�)����g��^5`��xT�U�%���W��H4�3� ��\�u$��"���F>����B	���Q����	�\�RS�����"
�[��z�$�T�*���l��r9������'D	��t��������}���[*l�5������?�IK\a������{�]�U����&1Gf������XZ�������^���a
��H(�����4���s��z�1��b���+�����T������t<$,q/6	M@����e��	@p�p)z{a������e���$%7�E��G�������C�$��%F�"(�q�$�CF^��<�����
Z�4��Sa2��"=��������Y�&�x����weBh_F���h#$���T�%f<��u���&	9����}�I���}�s\5C�qm�GQ�<.����*��hM��RAQ�iy�H4��6C�`��
Yp:���'�����������
�_zT�'Z�V�����#�L2Q5����?����'������H=�LR:`���I�#�$e4���	n���&Q����/�M�mAYz]��O�]w(��n\I�5x��K{�?�z
���������|�)7L��.,������;������@�o�&���U=���ceFo�Dh(��$p6�g����-20����n�=�/�I���Z�A@e?6Mi��a]�@�z1Laj��|��]u$�d�+!:q����i$��j�@�L�b/y�Nx�R��B]��x�EF��P��x��#Qv\A��]����F�	|�m������,�����$�&T8�9^�dJ��Q�GT:������F=k��Y1�����zl�L:I{�������x������k"|g�I~Y9r��&�8|�3�$]i����&����O*P�]����t#Y"��B��N����������r�
"!�$z��
VQ��t�ZYB�?��I�\��
jF�
mI#$-���NR
�n����n�:�xBI'Qy1�$�?��0��=���6�"�H���Q�:���B�@k�/��vq�?�Ys�E�a�{U���~�p���q��bM����#�x=�&�E��(�����@����#p�O!�Fo��p�'���r�h��|Qk�<����	�h��DUl�<�����waW�F���\~SUn��'	�,��q���x�6Q5�p��|�I�X��fx^<I���oE������Fu�a:���)6��m-���'��j��B�b���Cb[W�yl7��mK���I��_G�U�����=G�M)N�����NaR��'��0��-u	���D�1)����9�<����:�tK�i���ll�;:Fo���tq��u����G����������q���L<	���a���}��]�`XJ���G���y�8S�o%���h������9�����`9�O��y��MV�}:2�]0��P��%����H.��0�;#A
���ZF|����v����\H�kUG�c�>:��`����?�����n?���R��I��#�yTl[
uZp���:���6����"QF�*d&��YIE��Rew{���d�!��G��W=[�������<��t�Q[�F��$����Dj�x���	�Z������v��vC�����Z����C��x���E=�r�����A�]ea������:5�Q�?S�lp>�����#^��u�uE8���o�x�w������6�x�d;�o?�u�)^�Zu\r�vm�3c��i�oEc��I��P'���Y��=A�[��h#�x]}�U���~gW���H����+^V�B�������BY�a�_l���M&����z�m�/��S���+��1���O��Q"��|�N������z�	��b��]!|�sefD��;F&��;k:���h�z�Aw.E����=���*TE�6=NkP�rm�`���K��_df����&����3���0J}���b�y�$��o�������uL�K!=C�D-�@J�an�6���u���=�����Z�?��B���u���'���

���$�y�p������{�rULJ�gX/>	Om6*�p���x��A�L,��4������1c;�f������@t/P	+�$���U�eH����{���R�tj���#�x���*�^\g��K_\fH�dp�z=��8�Y�E�v��]�������@���J���0W����<�pYT�{e�����.���J��&�����S��u�$��\�0�pzx�I�����)Of�%|������&�N�=v�SM��I*����ZY-Sn&J
�**����w�����b�����G����{F<*sc��q.����w�;%
et<�{"&�0�)t��+]�M+&"z�������Q��I@^��?�uNj�}�+\i����C�}�L�`�����-*��1;�]�
�d�]����I:{{~05X�.�d��F�7*�����\y`�������:?�$��^`�2�QA������h���ZH�+���h������|���<��Ay�4Y�I�J�v
<����
���#�yi/���H����]C�3}5-)h�G��sy���7~��o��[��<���(���%  �K�2������J�U��j^MR��9����o�;19�^�����G���u�I��������Un<����c!�&+����\��<P�Ddn7k��=������zY����-F�OM[#w`TV����h�=�^C;���=�E�4����#��h���"��u��=6Q�����n���M����f��X3�;k2��^0�)�8j]��y��I"�E_V��������
�A��u�G��O���1e,���g������6�I�.�~�I������#K{�I\w�u���/�{�����T�Y�&LC����>��C�w�5�`���>;F4I��_�Z��s�A�Z�6H\L&(@y�X��w��P|6*�����Iy���o$Z�u�T�gC�Z^Vj�?^
`xA��P����#>&���]7�8<&�/6Ic�Q��XY��](RV����{�����J��\G&1�O��z�n�:Z����������7������^4Ia�v�%m+�$
�y�����&i��}^*������(���D@_���T:C	_�q�w�<1�*��z�$�3S��C�	g��&�%f���_��yU%����>���?-3�������|'�d�6n �;����`-$�~p;�n�(���^d����
V�}�$e�l�!�v���e��^ID�'��.��&�D��U��M�Fa5!Ks5�����/�F�q��&i����VQ_��qG�[������5��'�g8��G���2s������	�V:��_C���~�������&��@���
����}0������f��|H����W���X�w��L���xP"Q6�/����q�=��I��8[H\4��L��_6��5�K&	��
s���0�$�\���)��I���O$����E��bcpl����-�q�*�������5�j����g���G�.��� 4JbW����];wNO=O�Q�G]����7t]��L���R�y����4�${����I;��gG<\2IT���j���L2:�%C�7<�\���!|�
������
��6�5G��0�IGt-������	�.��F=C����������cE�N����Z�����)z{%t�
�Ip�~�J���\�S��!}���P�N��6c���
u�;Eo[4n���5K$���T6���&!��A/G����>�2����0E����(}�����q7�p�8�Q�~��[E�_��=&�G��e�`��G�K�����Uql��
�:;Uo�]�|���m�I�'��.�KDc7,��;o�SZ��������Y#x0�O���cb��P�����m'��:�������~����7
�i�n�������}�$��V	5��q�$Q��>s-X��L����[���Y����2�}U��p���ur���{��a?����������=�����mz�����h�3N�������<V�8�UEvC%���w)&�V���7��'�����M�>6v!Z�5S�v�@�o���I
,�
���7m�{G�>]�/���_�n�H�>%����{�S�}T����N0��h����G�L{@P��M�0;���!�@'�1���g��+��V^������h7�4�������=���(g�����.�$?]*d��3�I%�h��u���8��)w{����
��q�+M"m�tw��%z����%xMi����P��J<��X����))�h_,I�C���!��X��������/�$��b��n���X���j0}6	zu��Qx`o�S-+K��@��W��de%{O�d-k������y�[s��7ii�\���'�Rw������,�o�A���D��U�������������������YB9e?��mUm��Z�K���_�H-���$�������&��a�nC���3�$�fG������D�8r	��� �\����#I�PH.I�P�H4����YI.�X����������Z]��c�3K.I���_w�(���	��I�L�(�������&6@�`�9�-,`��q�M����������R���L,I�bO<���m���W����7��1�_=�����2-�a�u�G'F�f�s=PK���Ty#�W��y�$���|]���:�&�����<��Z�K����)��I�rI:�$x�e��\��|�{���\���$����{��{�$�j����u�<<��p<L����	7�$�T�	���:14ty0��A#���+\�r�>��1��P�)���R���\�������?#w~�0co�E�7�����s��������E{L-�e.MM4�.��<���������[��*o$'�KRH4A�xB�K��7�Lk�w�������/���r��G:2'�����\.�p}BNZL��^.	��m���k��$=?� @�{����"/y����G�Mc��J��������:{�&G�~��U��su}�G��/�
�`����_�2_p����{������'��"�3�Mt�X'��_f�g�2u6�]���`��g;`	��<+G���ejlb��
^.ICa=O,�5�\����a���<
���������<14���M����6�>
3�R����w�o��
�������}���IrI6{}�b?����\�Eh��q��m*���v���Xud
E
#���L}��K�Q	�*cH���\o�=��X�S�����xuU}+�KR+��:P�:q�T�!>��K�U�����,�����6�%��Z�W�~�|�9+2��5/�$�:u(qY��_.I��B84m�a����Mf���#^�(���������%�-,L�T��3��Eay�g���Dv�k��fZ����6�#`�%aaRy.Pu�?��D���:����:���a�O��r���:a�A`�N��sU��Ll��%j{u\u�0�)]0I������5w��H��s�8UxF^.���|.�)���%	'�b�X������t�L��!�
7��R8�����!���m!9]���s��k��4m5��-��V���p����&@���b^��������s�w��g�Nk�:$���g]i�C�.�(�
��4m4=M[|��4
Zp�6�&~g��zM��
�{�6�&�Mf�z�$�yU�b]�k}���/V���LX/�$xub���L�UoI�D���A���I.I��IY<'2M�Q�p�,��C7�U�$��Yx���)��O�@� �������<l��Q�<#�r$��+T�;�6%����9
#��uh���g5-�Qc��j�	��V��G����[�
W�H���.X�����h?W�	~!��C_\���')w����R6�<G����`�������FH{�8\��L4q�/0IY�f0�(������j�iD�Gs���N7��`d��/����YT��&�^zr�=����f(����14C�'�|���z�$�����i�S�`�>0	���%-������>�4i�6��i���LQ���E4w�M�}9a�.	�^�v5��4i��E�w�+z`^JU;z�:(�7VMY��u7�����ghNK1��P�(�*���c*�g2��uL�<+\t5��	M��F�T����;b�<
�5/�du�P�0)���5h���s�����*)��:����#[)�����5�E;�������f�m<k����AK2�����M�'�]:-]0I���jX�����I�D����������Xm
&�#p�k��'����k��3���pp�����l���J���]��5�?�/�S���KRP���%y�^.IE��a���g��KpO�g].Ie+��Y��_��}!�:>��%	P�b&X�����&���&������?�LW��o�(����^�H����M[�?#��	t�����!����0����|W�Q{�%���I�1��~���/�R'k��3�:
v��D�X�>�����7���%	�^�@5�����<����8���t�n�11V=m�x�G��[���u����5����"��2�Q�R���qA^x3��v�ep���-���P�_W��r�O�_���PN���S��i���c���7��o0�6�x�^,���7|����z��xbI�@'��#v��U����O;*>#��l(P��U�
���$�]���c-O��_�	�R���0�Z���>�tiWG)'�$���������J
�f���2��2f���T���
����j�i�JR�����{K*I	�<&����T���U|�X��x�z�E�E_ _�~+)��2+al���]�;�u"=����5��X
���WX���c{1Iz�P� ������<Nb	��������y�I!���S��������s���~>N����j�R�6+�����c��)��#����n����|��t�7���(^OFNW�Y��=#S�J��6��"#���S�����Hg�v����P�������$*�D��P���2Il@0�0��_&�������W���
o(w���i��.��Y�03�@���mm���7��_K&�EU�pJ�����`���{�C�&�s@��uch����������x�-G�Ll��
��x�d��������v8���QC���v�$��/��q���.��y������U�h�I�$R �P�:C��m����X���z��H3�����R�6tf��|p�R��A��a��
C/��M8�01%���u|iH q���v��9������T5^6Ow@k����
��qG����F�Hq�.�$������L���B	�k����f(Ey�E�d��.�dR����:�ZJ���[$��1-��m�U�(�UA���m�X�(Dw78�[J���d��������=�f��sWV59(�)O�g��������&?��P|d�v��&�Oz����x�g��M\�?Tv���b�H��3�����RG7tr��P�6��dC/�d��I8E�|�$�a�WE
1�����hjkG���F���������v�Qu&������^��v��B�~P������m�|��Z6��G��&�\�R�n���-�i-���U~��Cm�-������a]d����i�7��2��D��Y�9� �6wo��mdJbd��&��A 	�m�l���k(��H�{�B.��)[�6������XE�I0M�5�j��K�QG�~�(��ly�"������Q+^����lq���Bx;Rw��GQS�U�oHWR���i���
��K$1n���� )]"�w��?�-�i�Hb��
8R
Y]*�^����*%K?C[��C���
'��i�k�yM���p��zg
#�^��-��3�5v��k��l�r�����M���t5����$-�u(K��H���!f-5�tuOg�����KW�f��BW���{V��:iP�<(�8'R
�D��h���8�:����D����HlC7_�8�\�XW`i�����g�����h�BC
+y�L�5����4h��%���f�?��I�%�$j6�L���7���#� �UT6CD��#�/���!b����!Z�K�V�������}����t&Z����
-���K]���_���)w���.����$�.
^�e���b%�rW�l4=��#n����������h,q{Fbv�N�E���h��V�h��#��/>|#;�����k��5�����- E���G��(��xBZ^b�#)��d���g�=o{�������k��`GIE��z�o�f��G��
�'�2Qk�#������X��<#O������P6�yb����gx��FB�Hl�n[�55�G�+=����`�MI�8*cSJP������o�z�,uw��z��L-���p�6��	M49�K��#f�O
���&z�G�^�[`�REk�-y$A�W�:�*b�#	:1H	��
#��v���vE����W�B�.�Q����=�vg�W$-
yV(��G�vt
��GD��sP��|�>=�@2���T�{2y��>�GR���s �IW�#b��-��MT���L�J�����k��e����uX
y
���#�U#�����������U���g�ok(s�z9}�GR����o%�6k�nA�������	-0��p�����/����H���Gb��{��eG���=�LJ�7���G�������Mg�����z�+(��,y$�lc�w���G20������%y$}��6.>���m����>G��K�d$Pw����G2>1TH���D w�	
���@K�:�1�Lr2�K�� �5TEL|e�G��Z��p���#�����������TJ�n�H����/L~�[���??�{q$�od~)\Gn�_ ���*v�������>k�oG�^�a�3�*=[��vd�=	�y6t�:U��th��C������Q����6�Y
~�GO�����D����|���;U�j��T��g�S`0�P�R�.�''	���{i$����d��;}�H��/x�K#���Dt�o�4�Ht(������HB?�Uq�7��Zp�-������z��>Ht "I����S��~5���Z�E=Mc�#�u�Y���Z�}�+_��j�G2K��ly��h������*�G����8pd-���<�����DW���b�����U�%�R�J�����UB�#�<F��P�;p��������`j�@�������l���9;�) t`����4�U94Y�H��H{�m�I���ze����'\04��O���?�+~�������i���>LV��$��i*�Q�������6���3����B���b��Y�_Y��?����ad�����e���g���6�������l7�wMP���3~i$��R������FB6�b���������/8����K#q�U'�8����F�6j���O\��K�i�|�r�G�~���6^��z|���W�O��F�>����b�0^��T����v/����������F!�C�|~i$!�k?�����J:;~q�F2*��W�f�6����	��4i
�X�J~i$�6�0&&����H��	J�4)�lY�$����l�_��f����\�	�����t2�
u�S�_O���.��1�b��g��5G���G
mo�S��Z+��22u�gNJE�
e@
5�U�H?������N,�q�F��2����4f�G�_3%]i.�$d|��40
�h����:	j�b;T��4f�����8�a�E$-��%�i����VE�<?�-}�/	����JB��"�
u*���d����OzK�N�mw��f�kg��m�^���'�[����HZQ�NkZ�Y�79����Ny�Vz��E�G�2qG��8~Q$Q�������EB������_(���o$�|�T;����V�g�M�E�6t�
��3;� ^��
+cJ��!h������JpA�=��#q?ql�f�Iu���9)��][� S�n�?l%�R;-����?|�,X�/�$l��pR�o�����`%��
g(E�����E2)�����b�T�����e�D��	���-�uh��E����*a�Y�S�JG~��Q����ag��9�+�����c��j��x���������A$�#p����\j��N/��.����6j��t�����g\��h��A{T<��"�.��S��e���^�0(�9�EbF��1��;O�����C�GU��,��B�>J�����;Kz�R-��	��3�d�����+����8Ko	���r'�J�E��/������E��/pHC?Y$?e�D9T����b��9#>�b���P\V�=���:���h�n�c�Q�]�>���LG��+	8\��o]�[��m�NJL��K�T����6�+x�wV���b����Z�<�Q5�������rx4��"	���yU�,���~�:�k�c���88v���~i���Y����������[z��Q����5��?C�j7������e�*�?JS��_xF7����	�����?���p%k7�>���$�R����z
���s?�,��N�H���HB?Q�9]���"	��~���iM�����\58��)w��q9�x9��=�`L���G����90yO�,����bq��0���D��|�~��
a�O��gZ������I����UF�cw����C�O(��b)w��y
d�/����i�������{�H*�pg
���,�(��W��Lt��}����]�;��*���=��v����$�{kQT���v�M����KG�t��o���C���f(�COqP�8���������D�%�b��_��x�0���O��E���1�h�
���i�7<TF��"���~6�(���ysy��d����3���;��p��f�����(���L�Y���k&������+�����JNg��o��zw���6HK�����8
�D;�zCiu�sd�do��^c(wb���E�����o!	�����@�iT���"���I��6�R���4����k����]��i�������VF��jKE�Yv$�az���V�y�/�*I���M��2+i��1���<�8E^�������T�)�gk}���IQ=]������t�
^r�r�R��{����U�T���]��z�NvQ$�5��AC7Q$�����h�����Ez���e,���t��D����(���&�G�]�����(�
?8/�{V�v$fCM"vu�V��
�Pdvt�����w=dG9;2��-�}M�?p�<,A$���bZ�����k�um�F&��]k~,u�'hpz����yI$�4�Y�o��$���~	
�s��vI$�P����k��]It]�zU$����;��1�c�LO�OFo(��c`;B����jC�%u�~S(���`����������Z����{l��v���I��m���+">���������~��"z�����e�.�h�{�HLU�����
�KzX�l��F�)��i0���Ye���������$�TZb�0���z�H�	��O|��;�D������U��$�^����
x�<I$��S�;������D�_~y�O]��H�m�������Kk�����T���n�=����
m������D?�������d��,��n�uXo�q��{y�(Z�+8e������{6�D�yQ��/�����g$&�"��4��J^~A$j#S@(h�^I�(���)L���gl�%5/��>��k���~�����SG����K����V=��"YM��
$WU5����L��!�di�F������^�"�'2�>�r��i+��d<�CE+���N��_�h��`�+*��������"Qzh��:�:'$�$hkzMd�U��<A$C�$����`�� �����'�[-?j�W�F	�
p"oe�H0@$J��=`��h�+bB�'_��J�a3�D7�>�/	�w�Yx�H�P��E�O�*
h���m�&G�������SM�[ww~A$�O>G �|�
$����o�#c�B��Hzc69*��l�m�%���/�dP�E���u%A$��V��i�������~�O���r�*v�Q��UF<PJC��~�]&Fj�@|+�>����o�����.���#`{��
R���q{�����C:� �6C�������,#���"%�$�GI�������x�}B��z���z����u�Lj`���o=�j��F�t�G�x��U�*P���V+��z���
`�[y��,G��.�UO	J�~��\3A$C����8xM��m���	v<~d���;�G�P`���	"Y����;������
�@���M��[ix+|�)Z��G.��i���gK�	���e�9av]yQ��)Z��U�P����~D��S�kN��S��JEI{�y�?������`?�w�<>�9��2p`EJ�:�cCh2���;�D�Im\��E:����!"�+���g�ug~.����h��y�Ff��"KT�c��;3�������N�f��gs��4��Ak���`d�D^���	��%�4� *������H:�-���F=�$�(�TSK�����I�#`+0���_$�I/����/	]������E"��?��1�DRx6EP�F���b�[6P���:��TI�����&��VPb���[��,�I�iC
FW�o&�����i�R��Lw��\�1��*;S�n��%(����#2o��������z�i�5������%�#\w4Q^����*���&r�������p3�0�=�tP�fz����)���T��D��Ke�^���"��U��
��gz����q��:�P;_$�O�hc6��PZ�Ca�rMYf����C�c���3�i�������9�&�}��{�������6���7s ���w��eIuU(���-�"Rf�\�����vA$���Zde��/	�C,x�P�8/�$��������Hje"���xt^Ii�
��C���ny_�>��c&�$Zc�����Q�����XA���ov��Q'���qUM�g�e�1�JZm2o�H����i��Y�v[�w4�i�o?S���8dt�S+�<Jv����������3��=����:������FYU�'���RH���S3�������s�������=`� �h�	���}�����8�w�-"�w����CRg��.(�����F���R�{����j4�M�0?�3�� ����&����k�|I�/�\� E��G&�d#��R��0��L[f�K'�����H��NNf��+{u�Z''n���"�DXC���U��H6���(��D����U���z�
�
Uw��H���H��YK�[�X����E�"q��m�k\�4BnS�Z�\��m{��V����a&�$Z��g4����_N�36�e'�6Z��	")9�9\p��G����>P����3A$u���Ql
��|�H���,����H��
B����1/�$z��PdJ�����~
��I����y��������
9�Q�WTk<L�,Zw4��]`a@w$g(@�	"iurR"�@��g�H��hO�AA�� � ��
%\������`obz 9/�$z*�����r�{D���
P�,��:�����������zO��S}aH&��a�X�)t7(����d��Bwt��n��9-�A���6�h����)���PL��������@�22�K"1pK���5�M���3l���M����
DU3�<Z�(����*�r�#u�(X��r�����|��=��Oj��'M��9����,�v�Qdlq������9�?������}
8��H�g��o�!�$��(�p�S����+�t�vV�d�����N�P�m]I��:ff���$�� n��D���*)QF<������=V�HBj��V,����y������)	*,���s��xR��D��uI$sv1>\\��Hg4�Y9�T����<��+x���/�-��x���R�nQ2��'Cs��y\!G����^$��'�^3(#^G����
}�h�O�:b�`��,������+�+�Z�[��JQ����F���,A�eZ[��1��J�s1g����$�fj)�e���WMIL���04C)
�4�@g�b]�;�M��������:z�����(�Y�!Z������M���l��]�!'��U�C������`
���R��,�W�J���[����t���v��jc�����H�&C��z��.�����#�C/����������"o;*�&�������J5�Uo�I`�w;�4m�X�q���(��)x�e��k���<7���m�F7&�+x?�z|������P�9e�~�y4�����J�������fT���)x�8�S��vz��.��sF��@�]E��(`�j��E��������9=	x����UP����$JO�*��-��Hwgw���G��u�r���z���o��\G��(�z�����������
�_S�i�����?��b��������y�CIgTYj�����+���>��n$��H�U�:^���������H��iZ����w
C���x=��N,����C����=�(�Rv<��Z��<�*l����{����TH��M8TW*���%��r�������g���R��9AT�N�;V�������{�T���U7}��_�-b�x����dOt1lEM�U��Q�-����H[�G���n������@�G3_
�$ooM�����~�����Bc���f�H4�u�������0��2��Y���!��a���lP����5��3M���V�F0�JI��cMB�K�"a;U�H�OMI���'������dkCm�Z��yD��h�hS�G�f�+9$`�:Xm��5�F�^���X9�R����*���04���v��8X��SQ�"]�������!� ���������=���b�3Nw��L��'�2}3����s?�3���4�kI�J�v��m�
=A[�r����r-���*��<�hH��[��=������O���(���O��^�R���5�
s��L�R�������$�}Tn���e�,�g��g���bJ��a����`�cw<��#����������}$����]����N������+�hE�Z�(��;A$�C�s0�S�%A$ma����b�N	���])6~g�H
�S�ZT����G��(
��h'�d��5�~c�	"���F>���#n�_��{��@���xq{�B�6�O������sO4�lO�R1v�\39$��`��b���'���#�0�}���J��G��A
��T��t�lnE�����k�����?S�j*���+�eJ����}T�g$2��,�"���=��O*�J�::d����o8�k��c���f�q���a'[�L����'���#A+�81��o�G���[�l�98�NI���-Z(t����������H��w9$:�u����Dv��M��hnp�}&�$�������}����hw�	h����e����g��*��z�}����whR�8.�Q���A�g`_�F�9��[�� �Y?5�H�c$�d �|1���&�dV|��~�0���q�����DBl�C��,�D�\���;�G]�
p_{~���tm��o�k�1��� ��#�^�_����b�����h�k9���*���h�6��Z��Z���=w��y�v�l\���@0�w�V+�>�������	���_o���Pca6II7�$��h
��#u�:��}~M�L��p���F��2k��V���=�������W�4���wt[S@��Y�	"�:f��|��HfV��T�s�8crJ�����nw�Y������"<�����#[���Z-��g�y���������=���mi_s���O��#�h��.����Q�N��m��[��$,A$�2�[b6~�i�N�z����������Ea���J�
]����c�,��a��MR���47��#5�S��c�T�o�
f��9,N:P��R;U��h���������%���<	�X�8L \�z�Q��fBd�5���z�']��2�6���Q���6�_JC>!$(c��V����N�/���Jm��N������Y��H]������g�)B6mj�����!i�-�DJt!$����@��S���1@+��}��v���v��L�����o������������i���Q���O��/�d��������h�"�P��������-�l�����RrUk29��NT�=#����)����.N������,�v�Lr��R`R���wh~��)��(�#�DZ���C��}�tf� ��������������k���ghD�Sy�j�k��FA-�.��\������p}L/�@���>���9���~;���tu�h��-h�K�����K3���#![��L#���?��7Aj_���<��^S�[�i�;=��XA����
�t����C�����h�mB#��bZ���!FIWv�����khk�gh���"� �`����+�D2:��dGi]
I�Lp���"���M���3����B�a�_�W��z)$��p�A�T���?��Mx9&��3�,o�p/�5�]���45��;L�����c.����>�#b�S�^�_��	���="�?�u=B�4i@�~F����|��6����h����*6��dw��Yu(����w(|"��dC2�� H�K}�lP�����[.�$�E���RU��C��>&���4���F�Nap�G���L_w}�����T��B�Y��LPG�Q�p���{m���Lw��4�:�U�3�NI8��8�^Kv���Ar6��fs�[F�G��j��$Gb7'oX��C�]J��eGM����� ���������S�7�J/�
;1�%V:���IK�G�OM2���
Y>�#j���o/��\�����������x$t���!����uw�V`3���o$������
������Zp�W$��r�~aH�0h�eH��*!A[�e"-<���U40[u�q�����=�q
]���YW[5�o`��1���#�����4`�4z�C�e��_Uh�y��:^
�[�K3c�T�y��~�
��]��mr�o!�<�p=F��)���&����t��*������.{�����������qb��������8K������`�U��h	�(�r��Q���������jO�2��ZM��Q���m����W�6�|���
'L2��_��o��V�K!	��>$��*&��������BI��	�a��3����
�����!����0�G����9Z�
U��+�o�DJ�-:�i	1���R�~v��0��<���v�J Y"�q^I'�634�p������2��3�����E�l��%�bH�m�m���������ca����#u?��i�)�F@])�����nzf���Q_��Z2�*��~MI���?�]Vk��dp�p����%�8����n(k����h��K��.[�:?���[U��3�$KUO(� �����I)�����*��K����W�����\��l���,I���d��3��p���(�dCG���B���]�}@��:���%�)�a��5��G>�>abX���{�7trH��_� 0
�M0����s�������1��G�S��r��F]G�6 $A�R�{���tp��NGea��$5�#�p��vw4-����WFYS�nQ7��X�ah�J{@�����x�^�VG���j��34���b|�����B�yt"��3��#V�'@�=Q�r���������5=���d��F�6����=��|���.&�#w�_�������Ek��m~���:���T�(�`����.T�$��gw5��^>��28�
�rwo���A�V���)w��#������/����
���an�U�vu�t���#w�T�c��>����=��+���*�N-G�v��6xmM��"�8R%�Z���RA-X�
C_� ��@�#�wR�d���q��J��'��������(�,�W��zw���9��o}��7<�,�N
��w����q�i�bw]-��2A�$e��K��	��������QA��JK��o�"H�$7brs��{���H(~�
�����@����{�Uj���G��jM������Z��F!�>>��	:�Jw�RH�p�r��)x�������N�;���qu�L���g��jr�n:�{���x��CcL��#x�h}����=����t$�"Y�Q'`�C���	�MOVN7������Y�~��������gt^���}�;��v��HV{~���F���7�_1�k�&��d�.)Id"��<'p�<�MOY����*G��o`O�v��Z��}fYI����oQJId��Y������&����S�m���������������X���m�R;6�+�559 �j��G���-N���j��@oOI[�W#�+��)��tN��/	��X�s�z!$x��d^��� ��c�(������k�������5e��kh�1B�B|1$������UbH&)C��G�>��r�&�/���F���k�F�C���<����TdY��i��g�e�����������?�f<C3�v��m2H$�hG�^s�v��F����32$L�d��Km� A���}|C��g��GTC��@}0��v�Y5��7�Z�#p�Zy��I[Mv-���{#�4���0j[��'����	�3�h�a�Q
�V�v�<&]����a����$���#��<%��qz�xD��n_����m@�j�A29��E�-$�)LB�v���l/�DO��6}�d�,��G��zM|f� 	�X�k�2H��Y��7�%����f�R�N*|���]{E��v����t;��*��
bJ���%��?�X+����e���y@����;lu�i?��q��b��D�eRPG����)s;���A4����	!1��{�3+yjnce�r	��>��G���DA�� �?#Ba����&��*j�'����rm��<_���=�A�yh�&
:�[BH�@���{�`�K���������>�����9�l|B��1��|B��u�������� ��|+�^�i-���1�%��o\s�@��Q�%��k��F�����f�s���[2H��D4�Rm���d�|[eo�O��� �xB� 	`F���
�T]����g�/D���F���;��������3���0��5�db!���kC}��^���(��9
��J<�<���>�i��w}B� �l$3p�6U�h�AR�������N�vmwr2X1�g{b�h��2���<k���|v��#�<��aW�����~F~�,<#Q����h����a&�Mh1}B�A���N�"�2H(mu�j��,$��m�)r;��62�(�(�g�#[��h�@E�T3p;����ND����b�h3�=O��kc����*��l?��Qp5�n��h��������JI��TpD:����'����Z%��	%��}f0�'�_k37f�gV�M^3�����D�,�o����Fm <�zB��h��A�kN���	7fG����;���2W�� )��Ub6�W�%�TB+��E)[��e����:��=��&d���l7.�gh����a��y���W��*�a�[zU>%��T�K/ 9Tx�1Z*���<Q'�8�+�g�-�d�������&,r���T0��������A\���A�l����{���vq��!i�p��5�-���q��hH���\I��_������"����V��]Il�t(R����3���+)b��S�.�F�D,���T����!
��l�����=���6�E�����o�l���L�x����it��%w�.��E��=6�������~S�Z�����U�-��:2#iT�m$�s��f$�6�$�0�#�"����Eu�^� �8�R�Z���_I�����F}� ����UI�����]��z�u" ��%2t�F1��L��;��([H}�.�=Q$�cBhtX����9�T��*4��G[���#�I�e�Xc��i��j���tE�w���z����M����E�)Kgkkrz��kk��,K���v�	J\N���U/���������:�����t��iRY�Y��H�$}��o�yi�0>GX0��=�Gz�EpA�:��������e<Q����"�T�d%YvL��!�����+�J=;D����q�X
N�v?t��i�K�HV�P0\�C=��`e�H� ��m�Y��i,���H����*,F
K)34��#y~	��;9&F2m�o�\����
��W��f=a$_e��M���k=b��V[!�d=��e����{����=��TDt [�8��#L�;����g�(4���i�0\���c��������5B�)������=�E!��0���U�=;���w��w�8��'T��pZ�	���+�������u��?�p,��N���v/BB�pF�����g����V��Z���=�����*��!r�I�4F|qi�ncl���U�~C�y�%��G~L�l)T��H�T��5��\#���#������?�V��]�<R?Uh�E��=[D����N�V�l�~O��C��AQY�<�����K7�����B&�����������#�����4���QIu��ad����lz����������MM�aJ����H v�����/���r�n���H�g��k(j��������C��{�HZg���Xi�P�,��,=����o��Pk_��V��|9�H���	h�������:W*����#��[dsll��7\��<�eU����8�gd�R����IN.����q�Y�G��lh�g@grBK�v��Q9��S��H�@�<M���G���Ak@4�F�p.�
�f����X��H�zSo�C�(�, �M�v �%��������=��%���N>����VA�@�S��y3  �0+�J�QA��2��qy$AeG'���C��-x3�\U�x��P��P�W���������P��2�G��W�CY��04��X=4��aPYu�{,E
z"V5�G�sb��K�[�y���6fG�]����V�<�U/|g�5.�d(w����[oUd���!�u�o��}TN/�����D!�/��04#)�����Y����4�)-��-g�o��H�����<��E����C�:�y)�_�x��]U 5OO������I��s$�d���Q�����8��������SK���P<�^T��e\I�N�%$�h�������oFC��P��V��b�J9.���b	3j{��^I� �;��z";v
��
f���H���qU�6du)y���P���/�J�A����!]�GJ�%<@z��.X>q�g�
%��]\�EO�=��D�1���I�;�%��3(jS�"������M�/��4�o�"���#������Q�C�P�Td;<�]�G*��F��t�e��[S�~�[R/p�)b�7��U�����'���^Zh5�j���_�0�Z�q5�9yRT1��5����s03o=HG��`Y�}�Io�p[Gecq�R�:��qy$��	�m��D�^�rRq���xF)x��0��^C���t�����Z?V��t�p�K���Ob1?�w�����0�:tN$�}K�Pq�����U*"n�785�ZY��b!�D6��4;��6L�x7�P7��Je�n��xJo�(�ci{*���x�3.����!
�#y$?l��o�3��o(���������hd��FJ���)����1��q��E�k�t�#�<$�x�y|R��;�!�S�.���*���G���u8Q%���p��x���-E4F��I��#��~��1O$�/���;���u�.;L~F�Nh�3I�C����������hY�X^��r*��GG:����$��s'5PW�Uz�$	�������S����+�#m�Jt���V*�w��4��d,�`f�������@+�Z	1����N\5uU��o/���5��)u���G��F������3�u�F~���y�VQ,�+����]������bG9���$��c�;M��4_�a2R�~���Vt�oH�R�����e�Y]
����2��-yz�TdI����pU�����:��}���Y��]���O������]����������=����Z4��S;R���xp3T5L\��l���Q��j������T�I��iI��9���m�2��6d\e,�$]M����hP����=����G?�h�����5u5���{v��$��#P"*�(�>���
�P�>p��Mgm`q����>�������*�P���[�����32���5�`|���,-4���ze	%	d�B�D%�'F&�Du�j$4���%��[�ec$����lU���lw;,�$���^#�7jy�$�j�8�����m����4X]o�m��j������m[���N����
�����*��L�N�����*kp7�8_��Wg��E�f�d�YB�"�M��d���^���n����'i�������d��)��,��5�I'%CGO�y�I{
v���{�LE[�����V����/<;CY?�vm��U�������%������:4S`x������j_�d�D�l��#n1'$�di���&�>D|2I�m-�q~+��I�*���a����h�n��m��b�g{�l������\�I�W6r��y���0r��ra�f��������%4�AC�$�dU�Cu����o ��!��4��I�bH�#���g�L�_��{$���;/���j!�3�$�.�}�0T�������uOKJ�H�}�ae��J���c��`[��d�44�)�����#�R]�������"�'\�#O9W���[0f�����q��q��y�Ohn[���nh�$	��|+��E�$-�$k`�m �,��~�" �z�lR�;�<@l�>��t��t��,�J�� 
sB���9�+�� V9�4;��Of�;��{��@2I�Q
J\���<�Y��������]*������'�vtk_O7]Y�����������:����%�$����8h�ko:�KH��&�H'�tj��UA�j|p��(�O�@:��������Q-�*��{9����KG	H#A�.���#��n������gd�����M�4)�.�]0I�!������I�*�F��e5�^�5�~�Kk��I���CL�f�����G��]��i,��3��������d��G���3��W:��Q�n�F�l�����e�}�t�v�G������J���2C���DG�\� ���~o���`Ue�`,M��6��::��z�$�	������^�D��>��Lf�Jbd�(c�\��\�;�{���Q�����I�,km`��n��RIvC��n�p�����(�w�I:�{��k�5|�/�O=M��������*�z�IjU�T%'�thzE:�-� a+�#`�������\i���|�zE�+����L�v
���&�=m���XT���/�d0�z��
�i�����[U��_,I8��3X����b�6P��}�;�^sZj(�-hA��+O�v
��sbu����]��,n���_�j�[���>x5�S/�$&79w��j4�&	#�E�3t�&qZ���T\���I�
'r%���/��i��l���_0I�2M�+i"�G��BC��0[?����/�&;)'�[JZL\��q�v�f�Onv��M�v��6�|����_������%}3�K�O6���-G����VS�4���S�^�C/��t�[<�F>�#j���yh��h�!�Y��`~COO�v{&04kg}/RNK!�W�������q�f�7�{��[4��z�h����K����YY�7�.��p�_7���G���������Xr]l�]���<�H���`Q�B�������O+���p����TE��3B�H�=�$
�N��@��"]�X���3<U��%�:����
�[/��u�;��f����N�U'�w��x��f��G���L)*���{{�V����G;���|�K�Kb�~���*{kzrIZ@���L��/.�����(?���K���������rIb�
Nf4����!�.���M*���i����e�G��w�����z�
6�G�sbZ�h�w��Tl� I_���������Y�bI~(�C�wb���p��A����K�l1���x�3Ci�|c�Z?��v�Ft�����?rwm�Q��~	�q�)R�J'��a�N�������������{�+�m���GdG�~�fA�8N,�S��#�j�����:��XW9*Z\��w?/�CQ���L�X�]�6H�}������H�FS�n�K��M��?X�����k����w�B��I�c��PM�����4�CXG���$~/�vj��T���M���U����A��h��0�^,I�����>�tk7g�I��yh;�X�M��p�_��I�vG���h���0��=�5Z�<Ny���(���T��_]�����6�9zj�u��1�8���7�����C�v���p���30���y�����"���I���<]�B���S�k�c��%)l/?>��K��P�l���}��PB��� B�v7S���^����1����������Y� ��_z�]'��)"��I�MQ��3�H��7�H����w�Ws�$q��*s���y��`��y�{�������K����|^0I���LE$W�X
S��Y���������E���O�k2�L������
��yxT++���T��W��k�����y�$c��<�X��g��s����9��M�����8Q�.�1�N���I&q���u>#���e
RXxFY�T�������y�$q��!:\����4���)����~�$�����������\>�����M�)����}��;O�V���m�������f0�}hZI:��mQ5##6Z_������>�@rj�`�I|�����U_�TxU���T/�$h�^u���Pz�;����t7S��T��y5��&�����CR������Xw�����,���S���|�I��4i��`f�����Uj�4��yW�����F��^.��Nl��j���+yGy���?���L�!-�.��)y���(4�t�����.����P��L0�OSN���HUX�|�I6���IR�U/�dw�I���.���J�f���^�{[<uD!�B��,��gC�l�����+V����U�j6�vrM�S]��`�o������yog�;�QX����
�^�D���`���t(�4��LR�Kj�<5��	&�iH�Cq>������s��y���.���O�}A��Z�9���q�&��K�������w�4Q�),l@�H��X��4�hL�/5A�A���'�#���?-O!^4lV��
*e�N�AJ����	l����&�$A�j^Z@bI���&�)�*�����Ak�dA�FnbI�����<v��i?������X'~�W�O�������YIQAVw��Q��}������������.����������u�(�#��PU��M5�y���)����j�������K86�M��3}�?�CYL�����I�	�!]a��R�AC��)u�C��Pf���/�$���Nn�����������������M
�t����p$��p_j
�������s��E
�?�}H�:>����u�s��51�����O2�Jw��Y8�D��-'�����n���{��������q�dJ�U���B�d��s4�Wz�c���)��wh�,�6�]�%�2M�@�*g���n��ZK����=L	�a��yp"��ned�T�Z�%���(b�yy�X�%�����RW�:�=i�{�0�6�\G���`����`��F��M�#�d���O�T1�L�L��2��J���1t�$K8��Y��iu��x�������zwk ���$Gg�u��U���`����KM9��1���^G�^]q�A�������:���Fi����V*���\�$��1�pI��0�R��J.Ip��=�Y1k^.�;�0���DU���	N!(*����� ��M�FD����%')�$H��<���1t�_�����_�%	���(X�e����8�	z"����b.o��d*�
ZG�^��:A�YG�^!!)���xB'���:I��%Y�#��~g�����5� C��%�3��5�J�����r$��l��#�����=�\�U�6�BM��L�v��5�}�"��[w�_x��H���:vlie�u�`\����%�$��A]�cGa�|�����S�ut�'��i>Y����<1=���4���.�d�|p��2�J.I�_UF��E�+�$������G��Qg�����Lx{��Z2�Ams����c�~�U�[�%�l!������w�<�>�bX��z��8���"�����z��`�z���u6I.��M#��h�U0�$��Yh����%�
������� +���z��������R�.�*^S{b��Y&Y��^`@j_	&��~B�~&�#:������f&�4k���WA������4�I:_��'�L����a=S�p�Y��������D$R;��2���e�hK��v=���46�]��=��b�@���u���H�����&�$�P4?F�54^]�B��Mk��NU�Y�B�Di����dp���At`>��.��ov�uMC;�u�k����K����g%�A5�
tdB�M����g�k��Fb�1�6QX�K�
�C���K��K��
�����.�$RP�j��2�oFR�)�E�z�S�9i���Y����Z��@n��y���^������w������L���%R�.!:5}HH� t���Q5#��F���q�W�-i3X�'o8qID}��r\��I��V��vU�IU��h����!��0q_�v!�|�Q���U��lt_w�Vw�I~*$U��[��l�X��j>5�U�f,m��7>��.�}�$��}�`�Yz6�/�����[�~��b����������tj�N�����y����F�}l-�;[HFghI�F��z�v�������@_M������V)�N�4j�E��F�����}d�(���{�z�kT����9�Lm
�D�
2���;��K��mKB,�7Z���8	X�P��/�$���`}3\5�N��f���������-��K;
`�_Wu���K&Y�9�dh�����{,���
]���� �H$m:����i����"�����:R9�g�����N�v
�?hX��x3��D����84C����es���{!7M
'�_�j�Q���%��C}R�$��h?��������\�V����3�/�y��C��Cr������,����t�M�}�}�Ki�~���������I�DG/��l��6���K��kW�[���7n��XS������
QAg��l���t����L!@1���s��j�;g}�DQQ������2	(���<���9S�z��%��!
7<�L�R�ki1�/�����6�z���'|C�pFC�=h-��I�4�z��*����m��F�R����������3�}�$
j�M���}4n/�T�w��c���m��,HP2����}F���f^���L�n�O��������}�e�~�$��@
��.3�/�$|m�����PZ�+��k���6�f�����
�i�nN��AQ����u�N��Vk=�����$���;�m�g�{�n�����R	�7�6Ci9���zou��&Y�eu0��6��I��'��&�����l|6�:��'�Bc�<3�J�v5T�I��a\,IT�-����G�Zw�)�`Cr�p�%i� �/�?����,3'r�G���r�;:��04k����a��,8�v�I����p��	_2I��at1@��������\T���E2����{S�bc�>r��z��*5ed�4��m� P�U�w���S�4���YeZ��5�w�H"1t����"����/,����{�Hr����W�j�sUFk2��|�[�GCR���_�Y�������j���}��q�-��7�>W�gp��T��
��}��s������a������]5��v���>L=������;��>*j�:�����rw��	���N+)w��N����
?Co�m�5}�/�+=C3���N��q���34kH�LV�����v|�& ��2x�.�����6���e���7���B+,��jjK�d/,Y���{�1���)��O�z%���L��Ax�Oj��m7����M���7�l�!}F�KnIeu���q�J���s��d<C_�4��PZ�k���5d�
~y�^������u�j�q��j%(�������Lja�M�6���-���Y�^����V���-,"���G��M��P��T�#���m��T������w����"�@��g���lr�9VzY�CI:(x�7�6T�����Ws�$S��_�����K2�\�o�o�X";5L�K��VAa���>��x;�#h���]������U�{w->��=�f����Q�7����gd��"D�3�h��[�?8���(�X���cgw��7s'%C
���4
��d�*���� �W(1
�T�w3n(
L��a��K��Z"1��	'���
�y���
���"��p��^�z*nM��/S]�F���b�<�O�HC����^L������6n(��	%���*��$�Z���M�oVI6R_��:B��%Y�+�'�������N�d�t���%�����=���)�$���N�q�vwy�f��6�2��O)���HX3����*�hv�iEc��T�_;�lWQQ�������@��[9����e� }XI��D5��M��KM�{����A�z6�:���ee������G�7�~�j.�$h�w����7]������@o�bIF�`H�|C���[��~0p6��k5a~�w\���Yb�R-��pRZ9��wC{C-�8^}F�d)�N�=�����g�����Y	�T�����r��(q��*Zu�n������w��������f"I���(�zFec��|�w��Sh��D��{:!].	��~�O<���4��h��>�4v?/^�7����ZI��|����Yy��������"��]��W��h�$J�E�����7�p��)	G���V%�����#s*��V��T������$�8���#/��W2�8��}QI������^*��Q�:����E%8�m����
_*Id�`���x�W�~q6����y7u����>U�������7_t^��F%��Ee�D�6oj��(��>�w�RT'�6y��U�o����K^���|���.��?P��/n�Wh�&��GO��#���L*F���@Gh�s@`:2���kV�,U���Ov�(����&�$���X�Yd�/�&�d*�0��a��g]���On��k��<0�"Y��`��w���_��{d��h�>�����E�B,�����C�
?O�`Ss�z��Y�3�{'������*�5��/D�3���=����qWk�q������@Hr���mwtX]�{M���g"�.��_�aNH(��,��6���S�=�O����	%��}��)|F��
D�dQu��M(I��n�F_��{����qM�5u)�	%q�5�Y��$���8�?H}�g5�$��zM<�3a��r����&Pb���Y��	�
�B=(�`��h�_����8����gd��f�����y����`4{�-�'����x�	?����}	%)�gsu������l�c����w!ZD[� �<u�U�,�������BIE����+;Z���X�E�n�u[E�T���1t�$����7���k���P/�>5GX=:v�jy���<�32�!~��inb�3���*�?�v�}F�Z";g�#�;2��.VG��9��������
�2��#b�X�E��p?M����a�~(|E�9�V��=Z�����02���~�YjBI��A��B�w�$��
%Y�5J ��ad��=1��?,M��z�T�x�&���~+	%�z_��)�����������d��TJP��O��S�mBI�B���i��_vJ�%������51U��Na����:{H:���U�q,���!!�[�$(>��A�����5�$A����Sk�w����WU���$��z��"�t�?g�'tD�=��`��U�n���>�-J����#�h��
�&V��i?���������B��V@��\5�����R���vvV�����xL���~�ls(����#]��:��1��M�xGD���:��xm������������"N�k��q�#K0��h��%)��6�lZ=����������<#�zm�����d�A�z=��F(I�j�^��}�kj�m]]&��K$�fW
3��=��%���v�zn�D���U�*P��������w�G�c?��u	��+���5�/�d����:T���k���������IpY��|p1��]�JpV���f�l����;��8��j)������Z�J�m��t="���f���8�#]���[WT���A�] I�f���W����yia�k("�] I�G���z�O�C�Q�}��F��]�v[<���I��>C�)�N��`���wu�dXUU����zB�i���)�bmMB[Z�A^��@�=��ZZ�+� ��57q���}�y����v4l��j�\�FX���th�h��u#8�*��~��|;��!�@v���9%u��f8KnG��}#�W�[,��H2v�ba�mz1��f^����IZ�P%]����t�lsuWY$z���okI����L������4Y�>+?r���hW�Ij�v?J�I�c��B�������wt�k���#f��l����)[���P�NLxH�D'�f(v%^�Toi�������k�$z�l�w#6�:?l�o"I�����]��v�$��U��x��h���b��5���
��v�C�����Z!pG>X����a��e���#��K'�8�������+)�nB�y������g�����������]�
�V���^�v�F����8������&��v�$a������O)c):���1�������Ck���~j�ZN�67n7���a�\���I/"	��������]"I58�Mt��b�I�9a&�W CK�vs����l�������`_�a�(�3�
me�uG�-'����X�O�N���L������(�>l#�WmI������{��*	D����4:*���MOO�E�����gh���;����o�H����[��:���4�&~���O���Plqy��T���@|�7|���	@�	f4�yI������R���{VE��h%�:��4�"I�T�Ll9�g�"Ib��HdK(Nh��n�qA�����c�i��]��Tk/(�b(�O����BI��%���^(I0�t����BI����u�c�	����Jd�jHG5g;j�@?�g����ivhG���~����0��=:�J$	}H����l6I�`���n6�l�(��R�%!�������3����W(���	%lm^TC��
<
���|�C���
�wc(�PC)��\�\���f��n|q�|�U�Q�����khqo�����q������c������}�����EO
����R��q�����>��H��WE���kc���d�
����[��:pC �����a
�@���5�RI��B� jF��/��T���4 ����3��w����_�{'�-�������2-��
�=E$�A
S�����L�5�h����/=�m���t_����w�S@]V	��	����;1K4T)N�z�Pj�y�X�5]��������G�s�~�$q�p�1p��J�@�~�R����4��t�X+�z��v��ea��7���s�����A�%�
e�?CO������PF���M;���5$0����]p�Iv�$�������V�n���S��p�O!sg�l�U3�����F'SmO���c5'0����\��n�5�x��K��0� yt}�/.I�#�TF���J����'(W� n���2�fxLY���O���4�v������7t~�wD�����=�q��j�����oP���=E�>�����R�\�9����{<�����X�����rIF�$�����S�0��1�����#z�$i�	xj_���3�_(��>����Q9������b���>E+�+�)e���<���bbJ��G�+]Z11���S��A6��
�h��jF��[��91��^@���cAoH^�=
���PO�'�)����P��Eq;�P��nM�{
��,l��!���k���O�EO_������F����M�c����������P��)��M��Kqs�p.���`��I��3)����{4?Q�NJ���3���6�s�������7�!t�<�7�S���L�e�����Y��0����{��~�-.0�n�Hm�(���M�W�.�>2�W����#�Fz�����(����'@+Z�/=��`GS�����'�d�������$���������R?��{G��
��}3	&Y��j!�?YO�IV���/}H�`dB%�q�C��U��������(;���L�hd)oHc�>�G���{h��[=��G��sM���Y�G��;5Z��)��P����&i���z�M��wi�������S3����4����FU��!���Rv@�<���v���=��j�S��6&�X^Uaj=�V��w���t�����CZ!�������R^��Yp������v"���A'����p�v�N]���I����#��Y��I
��
�#
[�')�W����8"��h��U��Q�C�H���M�d�@�Vrr�u�����}
W��uC�Q�f���=7���~�N�N^(��_�H��M��P�AQ�8zw7e>�"��$qa��c�&L
�����~\:���F�
�Q���t��y����w&�D�Y�?�F�#�����!f��=p���2j��^�N���o9\����K(��N��o�P��l$��,4���L��3���*� m�w��e�{������m��$����^��G�^Q# ���&~���{F�#����`�G��M~�-��|�	�D�Ir����<��n�K�i#�g�1t��N��<�!��Q���D�uU�,o	#O��b�##>�$NH8=�(KI'^���B]FG�IL��5��|$�dv\��l^k#
�^�F��.z�P�I����,W��t'm�cu�La{u��9�Ta{\:�������SG�����vws���u������<��������/�����ch�7Q~�o�������?/KR"�����v]�2���#�NRA�1$��I�I������g+;.����5�e������I[��I'�e�����V5���+�?��H��f����
;��u���;�^-5��#�H:I1�N�q��NG��q�*��C�
�0<��mK�o�Q!jpU�I'^V_�>�P�IF���pRo�-���� n�V`;*��%g�����3G{Fm�+)���M�������!�@�G~�t���V�g����#�$�+�&l�
��F��L}�?w{�!{���T��8
����h�B�Qw��t�J��"��F�I�P��5+F�����u �g��U���}+�qBhrG�If�|��[��q��'���`�Vo�H8�$G��aAnr���&[#�O����G���DNkJ����=7�D`4�+c*��q:�plO,���.�<����*z�tl����C��O�����/��I")SJ�nk���|����C5/�$����@E����|��e�[Pr0~���h���gh&Fd�8�I���9�}`]�a8�X6R���� @�0�#`/��~:?�v>��t��y���Dw�k�5��f�pf���ef��H�ur8��XSN-�=���E���E����n�s(f���������P��s�q��c����Y�I�5k���a!��$��'	(�DR!�@���<[����O���K�������42�y�z5-�����N�����T����d�C?��]?���z�����?���`�R�sl��a�s��eO����ZZ���W����tK�v	w���m��`�YZ��$�$_�K�v������M��,�$e����������l��P�1U���'�>�zU�us�1���L�>,�������h��D�����[#��������mO�h9����dC;���TO�tz��G��!���ckXi#�}�pS���N��-��5Z�)#
/!zZ:�kt�Q2��������FL����H�tSZ �h;�gh�k�G��'���CaY��M�����t�?\�~�^�Z��u��u�A�����<��rnK-;l�"�3���/\�_,�U�^hA���-\��JU[�N4l}�>�4�p[�}�D�-{o��*F��j����Y�IJh������9����g�M���C��g�&�X��LO����k��I��C��bh�I�n���-�9i|f��%J���iW�X,����Z��k)�	'���;�6���a�*��[/3i����Q��P��#���o�D�Fp����F-G#S%,K�v��#
��%u;���&�$�������k�>x�8�����Fz!5���d/0/�Q��������Pz�ql����n���P�	�������@��}���j5�&�����d������.���	�8�4���E�����;J��>��&���u��O��&��K||e�7���=�����aAN4���f*�+�b�o��`G��@�-����6�[�������&}��|MCJ����.��Q�����v�$�&x����8bM|�@[j�j�Rg��zH�]4I����b8v��&i�"�m���/'�G�nXXm1��&]��v�X�����{�M��B{+z��?5�[���B�b~����7�3�hA�#/1�������.��&�u#�3�E���;:^��O���:-r���&y�Qp��N��.�$�X�5���J���G��|���3p�I|�!�^�>s���gQ-v�=���!y^9+�	<���ua�&9�:��LkN��I�d?c��c�B�������^4����@;�G~��Q���B�1�_4��`y�dr��X��#�2�6�bc,���G���V�����G�GR����)u��V�c����z�k�'
��g)u���Z���R�n��A�(s��o�9��6�:8+��$�?:�]zCiCI�U��G�S��gn)jrt5�X��U��/�#'�:v}���S'=_�#���%�J��s�=���������b��%�Cb�������k�q`�����?��
a��y�$��v��di  LO���O��f7��R��T��������]���dW��uO����FA�IW���{G�<u;nM����Iu6�Y�_`�6J�����L�V��ZN�{��=�����M����LK��)X
�rw���:>�������o������T2�W�2�dw��[G�P2��~�p�-�?b����O4���J���\M?��=�F����]�F�����%)X��U�/�$����7���rI>'�h���K���uX����|����\2�K�����	( S�_�;�����Tx9G���>�grP-5���q�xD��?��G-��\������k��������B��|~�$Q�:�"X0.�$PK�"�*�j��1e�pH���zCi�'s������O���S��A�U�U�c������&i]-��\���%�a8;n�(�KO��<�5��1��w[�[��gm���J��pm~��y
�}n'����%�v4_�j�P�GF`4AoxHYT��l��iu��x{p���XM	�8���{�B��<���iH[�<���S�T�k��t��T�i�/:N��v&�7�_9o�s�T�p�i,M&��%�d[�%���l����nu=%�����n��&^��uy��(Z��nc�3G��n�(�#�$Z�>5�}����6�R|�\�����~Ci��j�S��>;�YPU����G4��(4B�4�_T��c�U�O]9���aWPz�RI���l`6`�3?���3�����F��'����#��b���3�%-c54[����k�/�x1�ot=r������J^12&
��^g��G�^aH�����F�����b���������4%�v�0B�����fG�������%?Z����E������� �QV���~��5Y��'w���{��.Jt��AS����U��L�C���@'���5���SQ����1yn<���0���*w��^�;r5�I�������@�8�e8u�n�.E�O���if���_������[���C3G����3��4JHyV��uB���q�*���]_����F�K�Ypb�8:�����*���Ntd6�������z��G�~VIA
)~����;��={EiV'"W��<*�����UPH��<"��}�`:�iz�7G��l��S�c��#E80�����3q$s�	U���vk��#�k6����	�Z�d�R��p�+G��]\��n�<'�kn}���~BG2�6m�MMe��8��t�wr�Y��C{�H��#nN�<����t*�G�������L0'�#m�0fK!vx����<��j�P,t���<�Qt�R����q�L];�i)�=j3�G�^�$p���[Q�c]{E��W����KI�5��H{ 	�l�lY�@���yq$,�Ww�.7���s�.n���wM���W{����~v���3��>���o��8S�����@���z"�&��z�4Gg#
Oi����LID��S��(Z=3G����"^M�3q$(�y������y���Nr�$=P>��5�]!�a�@�Qp�g������<t���	���<y�'�
�8d�~�G��F����2{�����6�����L�)���2G
s��;.��a?iDU��n&�$J�s��j�Z�LI���3��w&��1k�w�<���{�b�+�6��q����|���F��ut4�n���`��<�ga���g
���y���[Y�=�<��3q$��T]������<��>q��2��7�	���16�Z�
x�L��@h��&
w�8� ,��]�lO�{o������������6]��
Z��-`�t�l0P(�����*NUQ�����/���K/����f>���{�D�����lRLA��7#3O�8��s�SX��
XGb��X|F�?�)�q$u�|��bD��#y��8���O�8�v���D�����gM���OP!�/�zV�'�����}�HR���`��#��[I8<�]��#����T(�K������"���G2S����ZH��#�nd��K�����\��Q����w��-w�C�X?Y���Y�u��z�����z�;G{r�N�vd��I�X��Y~}�H"�
�p,��F{�
��D����^&��g�%>�����s�(�H�u
��$�Q�RA���!�]�mTy�/�z�1)=�K���l}�Q�P�?����]���7MW5cYZ����H|�6Y-}���&�a��Z6�K���bqi�Gp�������u+9��^�u�,
��.p��r{�>�)2rcH,-E1$�h����7��zhi4�+��F-#�/r�1$�6�=C���0$m��M�0HC��YGXRl��������b-���8D;`Ki�
g�p�v��	��}��:<b`��k5��},���B	���9�p�6�Y���4\s�#��r�K����ccH��H	&,�����1$������
wZ�W���LLx��4_��xU7$S��`Rx��C7����':��+�42���iT�M��
oyxN��X��s��!��6]�������aA�H?X����CE�z|3z�����!��4��;FbiP7:���8$Z�x\�%�����!��2�S`JW���d��hY#nSb�I�A3Z���X��^{Cw���!���+�D�sH�{�#q,�@����R���uI��&�0�I>��J���vz�2<1;Z������d���� 'L�i,{���?�J"A����������xlI�p�#��J}RvC���<a'����Q$�N�/��J�c�H
{�	�5hErep�z 26�$������m_�U.x��F� �yv���fS��V3��4����>';���_�E����J���F99\u���$�6y����z���S#S����8�FFp���"���lcbP��7`��jJ	�����j����'7�J��q���X;�����
g7��."a��&�������
#��e�PhQ]hxnv����]j!���S�bd.���{nv���C����.{��� �h��3�04 Q�ae�;�h����J#&�������zt?K�.V((D==�klf4��]�,6��I�N�������G��'I#�m~�<wlI��jD��$:6�$���E���[j��
�aC�<#��"��O��`�>l����#(s(j6���R�������'%����3�V{5����gf������c����d]��u�!��c�H���:����-=v�?_�=�FR����1�j�a$���z���XZ����w�^�Q,����������Rwl0�A40�����)����Wu[J���l8�M#I���bu����i$-�3��K ��M#)�����
��odJ����	o	7��4D�N#1'����,��{l�vb�����%H[��C�3��Z0��+���e����v��y��)5K:����D�ig������q��HZ%�A��#��}��)��*oI�r
sL\�s=�H`r9���5�dTP��T��<c��cBc��8r�qW�_�(gf�����I�!��}�G�6@������Uc�zl�L��\�����vT�t��ugW��)�X��1]�N���5��$0]�N����(�m�����T@O���%Y6`�����	{�m���~������P r��9�E�d��X]���<����h�n[B]�T'Q�z8��]����P���}�:�wq�P_&����l)�#g��;�����:^�8�
�c�	NA�����R$�������w6F���#7��t�;������|�t�;�J7�n9Usbf�����k��3M'���l��cG ��I�c��������s����E;R>�{O�o7]k,b������������,L�-�1��#	!��C���p$%�������em��s�'�u�;�BSb��n������dZ���������wiY[�7���H!'�HR��un���Ol<��^5�cn����Qo��*3����~6�����o��
7��G�ws����|��1T^�i��4TkkS��Vs�Hr��`K���e��d�c������w��z5C���{���w�,�dC:��s!�<wb[4�tw~��.�{�z��$��B-M�,���G$Z�i,L��K��Q���Hd$;OgpK��kaqe��\�w=6�
;
CS?�"��&�X�������&������K�|7��W_ 	����D��C�L�w���)Y6�~6D�h>�<IJ�����I��2������$i�������I�6I��,�K���{K���4Z�57�$N�eg���+�����cH}{G�4�OW�s����DU�����`=��kA��D���.����F}a�����zt��N$��p=
����+������z[3Z�w��U���I,K��MIqRV�����=@
V�\�w��5f����X�yM��,����
�&��5Y�'�>��Q�o�D��$��������Dc��fRM���;����JA	���3u"�hH�nd�`��D�����P&:���)*?��[A�����v��9p��X�!=�4�[��Zw��@M���V����8/���F�'�{d��Xw�>���G����%v��*?��/dX�K�C���W�#�a��u/�{�����v�{MW�Kb�F������].g��j���o"�L8q!}`��Jw1����b��
�+�%'lhJ ���tg�!��E��R��V��=D6���R��R�N����-o����T������o��}q[�d��"hm���\�MQ�����t$I!E��d\s�O}��H�&9Cka#I���LDXT��$�����_I2�Zp$�e�iQ
��H�2�HBq}�����F����P�V8��$�q$*A�s#I��wvA��\�p���~+i�l7���0z��S��������,����i������ZD2[�@e�l(�9wV�TP\���r�Z�a��|�3�
-q;��4FM��V��#���W��,H��:��F�����^o��#I�6���S�lG��T54��]+��5�
��%��	I�TuU�U����<��D>v}�K�-1�g�r�H/�e������
�
��`C���� I���x+��k+�)����v$�q���V�H?h�C[k�>1,~	�yLq�����TG���Y��.��@@#�
����-mh#Ir��R���[q$��@�53�J�}:�D�8�5;�~���I#����3r�I�Db������a2��'+aSRZJ�SI��}�k8p�K�.��J��|�>����zA�>'qI� ���'o�P�
%���$���Z�������$h:l0�[@���d�C-A{��������������37����G*U7��,?6�����=/�B|OSKX
��	�JIxB�M��=���>�Ha�\6T4�#_������$%��>��G����'�������%���3������V}��^��I������R��u���G��{�tD[eT��1��9��Ex�(f��t�6�d�aCA��e�$�9�29�i�RN�l�����G6�v��sD�6�?
��Wn7ft��c���G��	�1�k.�zZ���7��%Y��"��_�A�g��$#��zE���<�`�D��q�:������Y��-�k��C��Q7f�_��4����b���u��:*l�����������X��f]�r���+]�#|[�5�=M������-�����}���S����c��������"����j�	s
hF����P�������K�a\u���Z�I���0���A\@����Zp�v�Z2�����^����di��m��%	�X�+������if�"���^����Z����-ek�*����~�\;�T�z��U)����ZR�#�Kx�I�'K@������9 ���G���%-��0B|5s[��0Pj8�������K~E��02��KG��C�/�G�
�jK��%	���_
�U�\���\�_���������>��� ����*m@[\����cO*��������/�p��7���c���[�R;���Ohi����{>�:�H������y?q�����>;F��K�Wu�hY�H�Vd���14IGL��xrv�f��?e�C7����n2�$���H[���I�v���nH*Q��T����Zt�:j��1���zJ�@H$#�"��.`7�����V���<��]hb'v�'f[	��d@j��1���-������;uK�,�����0���A����V>jBa��u�f9�r���r#�������L�zb+HbR���I]Yl�6m�5V'��/����dv�,���m	X
N��$��/(��No�Hh*����m����>�dH:��#A.�@��m�Z�<�XJ7�F��7��f�ud++��E��>:B,-�mq��sR�L83������1+Gm=�p-K�.Ic�hN��R��R�R��v+���nI��Xy:���|�H��"_��[��MM�����0�Wu�	=+;Zy�����)�Z��
|��c��q$}������xL�J[/CQ���|�HFD����y�O����[��q�v9;��_��j4"�����]���������]�GK6MG�����t��\A	MY���ge����k��5��������>���h�����+g��wV��!W(N�P5��#��aM�c��0��8�:$�l����qI��v|���U�L���M�h�z`�K�.IC%�$��p%.y{��x��zV;f�MI������<y-^`:�`-?�HP(�.�-�K���8�rh�;B�M#�kZ�Nd�u�H,�I3����7��&�z2M�w�i$��d�8j��w3�e�
,�Ydcv�Z"{�\4�R�z�zqD����M�Y��K�p���+�Q�&L#�C�x��d_�"�[7���3���h���d�de*�$�cY=�H.`�K	�����\� `�
������#n�p�
I����P�Dk��n[����bw�Rw��������}�����yV�������������EZJw)Zpj���X3"|v�;���mt:;�nI�D�pmD�Q$�xj'f�������Q^/�G��7���B�	x�6��e����������	����F�\"����^��Dl����yaZZw��:O� �H���H�V5t�}������S��R��#����6%����N��f��J��PV�[�K�)!�^)m���f"�t�P7%�(������	oSJ��:o\���6��vb*5��"i1����=a��%t.�@@��~5�Y$9������C�1����`��8�MHb�ij��[��-M�������d��-��q���E�-�R�0Y�2��O,�L
�b5�	;@�D���Zn:�H"YrW����R�������z|q���%y�������N���X.M7��b�kk39G�z��f���/x��V��������$�i�����7���"I�a��SU�K�Ea:�B=�H�y���*$�b�[��,�oH��>�--��X��2&w���KZ:�����tq��=~��n��U��B�:��}xQ�3������|2%"+��R559����$�
������G�R����x����h����MX���"9v��l����Q$Q3D����E�zG8�:Y��E��x��/e�{u�;t�UN<tN���20��z����-9��� k����]�w+G�h���p�l�DpJA�R�z�T�)�I&Z���lM�Q����di�H.��R��9Z�(������h`��G�I��8�u��d���	�U57���V�*`���&�:�$�_&t�8�� `Z������B�D��i������+���0i��%�]�w���;J�H�-�����y2����)e��$5����%���	��]5�$m	rq�P���r�Hl��_p���6�d��!%�~7��]H$��+����8�y����5�hu4CL*�Db����JtW�#W�m�-9�%V��/��A��k��D���I$� �{d|�A-�I$me��x8@��K����%������N"irZ�����g����T��$����=d�R'��l~-���j���V�5�0��D5����l�n�D����|MP�G.��{J�-6����L��d�jO�S��t��UC����1r�l�Q>��{Z��t=��r�!U'����I�u�vq��"�4����a����
x����|N"y�p��"D��k�H�k_:4�a@:�H����-�GKs��L4>��t	B]���KdWM
V*x'u��d;$�6,�K�!�{�%&��.���+0Z��w���� ���_U��R�{k�����x.;�����'&S�C�}fv���piR!a��#+��&gY�g�u";��5@$"�	��#DRp�����!oI��KSi��H��=�q�V0r�H��(`�z�E���A$��/�8����\�v7���Z��&�7�K����:
!�9����K"1q�u<=�K��Q����/�a����,\�1|eK���g�y"#Wd/���N�Z~(�hr�;����=L�Q.,>�+s��i]_�W&�~^�vC;�b����^3�
iK�i�}I��� Kn������.o	K�����2�����8���o�A$����o_s�H2�
#5^�")l��'�"��G�y��.�y�H",�����A$�#^��
'�K���xU��k��u5;�K�YT��
��jv��@����S;�����a]����$��PDaRvI+�������N�w�Ed��3;�����j?�VV<d9cR$��B��K��AS���eu�`K�~O� ���r�]����I���}�\��r��"u<�%b��8��uE�!�$�e�W�o�A$��&=��@A��H��+��j�"y�H���2�oe�����l�`�K���#���X���f��U�)�XDb�Q������]��
H�� �����H��"���~����T=t�K���I����>PH����5X�U������}�(/��$�}K�����4�a7�f�����l�M<�T?d#�-���H�P��jg�c��H:�8��z���n=�}�����:�$T�������������0���� ��\A���q�"!.����<��A$,	NO_��H�mB��	O� �N�CB������A$��C�#��4x�S'�	c����YB��5[�
�������ZrR�-��:,�z�h(�o���dW��t��WC	h������|�j}K���#S��0Q��X�:�'$YW2��R���4�'vLs�gg+�i>�cA��$�9'nB���g�ysH2U�p)2T�_����_(v�����X:���g(�Cb5+J�F����1$���@�M��W�������,�*���%�N�pF&M�b�<}E�4��_4�����h]���w,i��:rYR��"a���]�v��/]s�[�.Y����|��N�<iJ7�ryK��^�����io���(���h'I�Ac ���1yn�ait��gpJ	��U�O�Y�CC+K�>8�4�J>�/��t�M�:26DdzHT6��6��j����
!���x3p�w��T;�@�'��tl�f�Q�F����N�Gg	����W��N��~ �(�A�U���]�L�O��)�W�a�������c,���~@Fbq����hG�R>^�����MWNH5wg\���_`K;:hX�t\s���������v\�-�z��I��f�`�[��jh�<�P���P$=�����gf�p�8�5�W�-aZO*Y���^_k�]M���$�!*�wmqgG.K^4I���@�n�,�ZM�t(�
������~h�����z��"��!���u}k��=3;<V��}7�
��%d��xC���(�lIb=�@��v�ie'f'���>%}�Fb��0�4�
#��<�B��U����k���"6���Ct�~U��lIe���C{~C���2�ZQ��tI�0�ZQ����5E����v����4F�j���R2j#A&CY�v7����V�t[<1;V��a�T$\t����2�le�H,1��2%��r2�FSR$�^�� ?������7��FL���Yze�H��{��������|�,�������;�'��gm��E��m|d�F3��d,i�'��( e����~���(���[����k���zl��w��F���8Xq��.�i��6���S�����:hP��z"Hd�'�������i��Q��T�('IA�{�$t����t)i@ui���
���I�~c� l��%n��M=v'm��Q�A���#)��;�x^�F&��Y,��D���i����4�	����q�6Q��]8&e��|$�x3U����D"����l���/�)-`��������l�����AZ��5_F���CQ0�ylei���{���b��C�1��(��%"~�[ ��R����[�Y�:������`$HSC����1r��5��J���"�`$�s|�������
8��\�0�'�#��Fb����b=�R=_U��Y�q�^�o��p����i�FKV�K�;��P�S���i������#@�X��z�������.���� ������SZ�wz*`�I��(Y��],��g�;�����������T���,�^xF0��f�t�VdHh�Er|_0�V�<�E�0]�*,���4�q,��UQ��}<���+��w���itR
�pU���@p��Tj�K��8���jQ�S���J�r���K�.��[*��Ee�E]������V��1��F7���5����fvo��iD
��nJ3CH(�f-c��w��Z.�������%����o��w��F���C.|uI����SfM.��Fb$��C����%�1��i��XuI0~�������R����k���1�f�����l�h�v[]���W#�����^���mJZy�~Zu�;��Y39�hR]���l��-�1Fu�v���
t��q�e�>obW��M#��P�d��@���i$�����Y3���k����^�����4���gbgU�\6��h(`����4�iT9�+�$g�T�"�U59����?8r1��c��~�>�h�`u�;�
�����h�Pu����\QX�3��%��t���Al�����]�n�U��u�@��z����o�C�3�0�l�:�g�0�T�/�2������[�����n)uT����c��%��pn+Z.L��K�� �.OR`K�,nk4�Z��D�
j�>iNV��nCw����&���vx��"���3���;����5,4O�)�
�l����HU�����	wu�H��������&�\��RH�of�HZADF��n�D"�7����^����Vu-OXi\��V�
V��}�+��jY�1������/�+��&a�p��X1��	��S���w��L�q:�x������O�@��k��zxb*|qK���t����8j�km�.E�l���f�4�G��������r���ou�H�������n�H�A@��� �n��e��(W]��S���jjWa>NuI+MY$��a���"Q�v�6�2����&�iiE\s	K�k�HT��D��zf����	������}{0F�K����m���Vg��F>��
��-�;jS�S���:�d$���7zpVg�X�n��jZ������Y���F;��xe'v�K3,��E��Bw�|g���"���nl�v�8z�E�>�V �D �,���'�=P��`��r�*Q����F�/*�����r3bP����s��yz���2wI<~Ku|��H�d#��]���s?	��<���p�Z�P��]�����Xu1������MY��;7P����%�����[0��P�e%oB���s����E~I��W������Hm����s����zP��FR���*\�zI:r���
�G$_O ��HF�$f��R7B�i$Cs��j�2�\����^���#em��M:���jmi�%�:��g�#���]G�Y�������a���F���(����l�F2a	U�Q?����q����Q6v�m��#\(�T���n�����E���U�F�5Q��wIbu���|��n�@l4#�
��(�U��k&	^Ks��>��Q"�m��+6*�����]%+9m=sd����j�%�
2�&5��D�rfk{���&/�%kwSU��/JL���]~�*�����X/x��%j�@	� ���mi�%�|M�������g��D
a�}:IB�������4��aCd�1PsI��7�^@;hN#���/)��iK�>6�W4�\��=l]�kV����F2&<Xg���}���a����H��b�k���-1{F��4k���t�~@G6���t:�Vw#����\[JvG5kAu��fV���y��
���oK����W�������$b%��}�q����<1-XW�)�\��dj��3��
>!��g���c��������;�s����R���Dl��>Q[�4=�9��R�l�:��k�*�W�i��"*������1r�����J����F���HX@�y[��7�o��V���~}����������V��?4��%t��N#i�S���,�I)���������3#����i��|e�����R�������&fQ���R���"�a�=s0���	'��#�}z�H�p
���Ew�;�E$4t=+j.\k�����D�ws�:���(��\�������!�=c��t�b�����~��Z���Dl���B�Y*���!��I���������N���
���C�������W�dp��I!3mK��[�������XQO������m#C���J!���F��uZ���7����u� ��/��T-�:�e�(~���C��I�@^���/���H�m�����ucS
����X��8��EF�8msIjA����R��SFd��hpg�KR������`��N��������*�x��P�\���C���XBS����~��9N,���6���~q����M_��EUF^y&��-���m{D��Qp�k�o92r�n��A��&�GO�}�|����m,�%�y�,�_1��0m#�(�:�qQ�������U�Y��&2Mfc���nGS���H�I�\�3(� 2�:tS��d2�
�^�)$4~������u�����BO>�[�av��c�����ch�J?QI:Nn&����M%�L`I�8C��=K�}�~�t;�=K�R��'�)-��
�,b�A��I�!N��M7��'i���[�E�'Sb]Z�vQ�U�Dc���A��q0ILEb��W0^���Z�K���J��'"�����Ac�����$�6��M*t1�LbxT��l��=O��U<a�������d8�~i)�*�ru�����vW��PR:��9��4�&�/E���COE,uY}�sIb�\a�9�%Y���C��b�LR�������M}�IjD�]���~7��N�E�^�����f����jRX�`��d��,b���
&��e�;5�
&)
����]��w�-;lO�d�,�/m�=���*;tmL��EM-7,Z�����P�����M������G�5���,J�L2�GH,(X�7����Mg�m
����L�G`��z�
&���t����{�v
���0��z�v���e]<
e��f]�u�^��d�UNN@�Vku��D;T����P?��}Dk�@��v�;M4��;��H(Y�����>��mI��@s���%5�����L?�
&�h_?8�z��O`��h����O��#CG���Dx��
&���������7�����z�3��i�sr�m�\l6z�r0�U�DNZo{�}?�I�z����]��m7�Z_���vgS��N��g���=,���Q��Y��U0��u'S�}���5��nKjh��@s�x�nI�E��������%������ph����Z��H�o.�c+��P�/�8}sI��76X�7�$V`���e����)�'d�t��.w[���C Gl��n�;N���::��%w��G�
�qe|_jwn��=�dhTD����R��M9��]?T���F����o}SIR�������
)%z���>N�������b_��$a)������7�$k�WB2�]U���<2`h��_��$��{���3��c�_�����1-��t���rYMbF�i_jwIb�q�����/��!�����A�(�YI�x���uS��H-�bE&6�K���R\7�k��m,I	��D����$�=��PJ���%1�<�7���$��Qa���Kb=5)E���h#xnI�!Q/���nIcSI�|Tc��m�.pc	�-�A�a�E7�a`��-��2XGb]���:r�����������v�I��@������*����M%���\�Ne`�nI�5:��T<�M%�)5}� Pm8\�N&�����T����5E��4L�������n�U�}�6>��<>$���eKA��n(����������lNk#+i�c��v\�G��Umj��v�Q�
p� �����P,K�No�D%I�o��r����d��j�B�
7\�>�V(�I�Y������v����
��7��s��7�J�a�����S����6*2��j!�`���%�[
U�c��,��9��RWEul���iU��M@)�8�����'j�6<��N��tU�CPb��%5���5�t���&��G���[����"�����jw�
G�X���J:����p��=Z�����t��
�J���p&Ij�������$��:Jh����iF,���E��������"�r���&9��rw�DxWF���\�N�X�+M�8��CH�
)�|L��bHZ���K��bt�������lc��p��0����P�E�H�s$����R�{l,Ih�]	�Al,I��\pt����vK#3�o��Q=_)��b��t
�|%�����fb��;� m?��sl(I��
30�t��������9���<\����\���SY}L�vgC�kI3�8#�='6��eLX-x��O�9�@VxS�Xjw3��eY�����j�R Ib��f��MkA��Ibi����$����L���0�����	B?��y,�����d��y�E�Ib(�~��&��=���SA$�x������$��-��O�����- S�����c)�c�n�=#vR�m�D�v�
��c)��j���1�MA2�p���y���
lV�5W��2�����3�d7�I��%������3I��B�I���nc���j/���B�)�\u�E7����'�UIb�0m���w���
o����S}\��Q%�3#�%'�������@������S���T\���
1.VP����|I���|��!���=</�������Ar���Gc	������KV�%�L���K`f�R����:r�H-!yg����-�{�)��MF��N]����Ts����E�l��F |%�X.r�~�jH~U=i:�������bzR7:��B���*��$SshL-`���#�O����t��K�A��V��z���z�d�(�A���sI��Dsk)����c�]u]�0-���K�NH�?6���0U�]'�Ys����e$�Y��$v>$�n�uE���P�8P�|��88Jb�y���opn(IC���k����('���}n�:Jr��p�E��\�Q(X�I����P�����f�s`�p��P���g��(���"\=��%y��#��BS������gP3��}�����P8R|X��7Z���k�Rq��VA����L�Y�<��}���0�%l���	v�@A��t�a�sR�����\�deUb����*1�Lb�J�`�������4�:��d��vM�i|��\�v�ZEc�M.��OG���X��>�����T��S�|9�*���&���_|�������5�@�K���t��CH�C�0s)�cL�N'����.2�5�B���\��0�Vj�
4=��g�V�"u`-���vH"��i�^�\rv/Z�z����q�t4	7_������IBe�1�����h�#����	r4	:�e�$MG�$��f�
4���&�6���2���7_�:D�q �p4��S+��%�#�_����SK��y�m�^>�.{���L��*v#6�����\*�H�o�`�k���(�"N��;��xj����
��m_S�5�l('=�HJ���$�(���(�p<K*��%a7�IO�{���
Hd��$�������S�������R����\��|�9o���k���g���QW1F:�& p$�}<!��O,��� -}.�z���m��l�#���Fh=�V��6�hD���z.�z������&FnX[����%��s�xmp�,�YU���F]����*��������W){c����~|�'�j#�z.�:WmM4AB���%^�i�!�d!$�M'j���l��GB��`V�F��=��#��L�F�
;�7�\���<���m�������p`���K��v�_(���1�zb�#��V��#���I�"�Ot#I��I����p��z��$�wji/�����2������;]+��O��0�Nb����\��CqM>�M�yP|�R��C�n����Ks��`-�X�t��*O�q��V���C��M��G���	�!��#���y"K������k�@
���m ���Uf2/�&��
$A*�=&�
���'h��_Hh���6���!��#�l)w����y����sS}��p|3#��e�G��(:r�A�@�^��C�� |g������mdCO����w��6�	�8���c�	��kB��7���=����c��$1�e"�mb��PwK�)�3��/��Cg�D�Z�W�D�c`��n���91�U�k$:Y_Q5�%b�>P[�2C���7��h!ivJ
��z~v�Z���N��e��P�������%U������Of[�U�@����>��:����j��$v���������cQb'$`��{u�4�e��v������l�|��V�t���1��yO���ME���f]|u����D���R�������
K���F7�$N�����w�Hb�'$_I���PGH$���r{��4�|A3%�0��M#I���LO���}��1O%��u�e��3�����������T����{�Y��@F��{Xrv��	�kQ��V~��r����wA����c�H���Y�f���h����t�X�>�
#���G
^�a�H"�d@<o����a���W����D;��4�|���UQ]�p�ou���������*�k�Qwe��TeI�j��=�U��h�v�\4������������P���,1�Xq������R-Z��2f�$���k��o��j
�ER*����.nI���l��6��M�I��K
���"����!y}3K�.Q�t�!9�i��.i{�������HwJ�B��DO������]���TKrq��i�[�{FO��f��H)�D�O������]���G��%u������%5O�Uxv55G���������h!��uI|<�?
m��T�O�"�����,�W����X���h����^
w)'������'�������
����fY��9��A���z�$�����H�/������8z;ZV[�	��+���S�P�
W���K�.��X.���.
�H��Z��[:wn;�xJ���:����B��of�HR�?kt��H�mJ�b�3����#%��j��c��D��� nZ�~���;���O���'i�����>��-�����NQ�I�)��a�~�k�)\�2���6�$`6Z���"6�$Qk����F���
}�l�]�>�H�	I���.��!�R�:ts�/����Kl��>�i@������A�"���#c�t�����x2i�[�5�S����t�GG�$�t�������8�nK�@BV,Seb��kC��N2,��7�U�1a���f��_�n�[�.�
�@�4��'������wn���5w.@��P@)��S�7�$sA�>n���m�k��Qt��R��J����Oi�HJ��!B�ht�v*^��"-=�n�T-[����nK�LT�-����;�E
-Kb�1�Kv�i��:�� ��`]�A�UqX���A�����\\Zw3�R�)s2k�l7�����N!w^i�H������6�$W���]�U��S-���1a�$�`���>c�w�b����w�`����_�~�'���3�x��.�{Y	:r��z��P5�x�U����]U����@��������O�r�4��q�H�~��'����[:���]�>�����	oKj8�$����=s��U%�`���*�mI��h!|����Q$=�4�)����o���K��I����J;�DRY{�(�R��/U�y�F���g����m�HZ�/$�JwSq��#����hb������5%��
�1D\R���=TT��7���vJ��BbT���t'�$�P"�h?:�$YA�Y�B���"I����y���Q$d� E� �t�vj�3�b
�9��:��E�G�TU'��-.���C�A�~R��x��&�[��-��=�A+�<\u�R������}��H���)��KO�����4�Ck�*�~A�fV!.��>!i"%���nI�[�h�38r��C!&���c�D7�$U�{���|��m��D>��"�����K
W��ByYW�rI��(���U�8��P$eO>�@��K�~ �J�����0resg�}����|����9!�[c�D�1r�������x$G�ThW���G�H/���&�)<��T��(�����G*H����98��vn���,�i��H�w(�$������1��B�yj\"�(���LFAJ�K��]+o��7K���kU����U����$�i���l���3��9$����]��F�9$V���Up�
�jH�)(�+A_���D�^HS�1�[1+�9�CWU@)��uT?hN`\
wI���jb�����U�;~2k����=�����W���1����F������f�b4�%nIe����������"s~4�2yw����t�D�D�������������������O��=��w�~���������~������������|r��Gd������o^q���o��}�����^}��g�o�^�����_��zwL%��}����_�u|��wo^��'{���S{6o��~�����x\��������dR:>����_}����p�����w�`�f��WO���Nf������F�*w�~������������^�?]�����}���W�_�F��cs��~}���?���w�����qF/����|u������?�������xkGhv�?��w���x_�����/�{����W��������'�V�?�������_�������z���_��7���/�~����~v}���5������������w/������������������������P���������O�������~������������}����_|����f�?���O��?�w�����'�w���������+���u�-�����o���{�a��������8�w{��>0�w���~��x�o��x��u���?��������g���x��������?��������o����w�>&�������������7�^~s�����/���}���_��S�<��'?�����������������������/����F�_�������?��_��?��cq��cz���)R��/������_�~������|�����=�����������~�g��x��������w�����������N�����{�q���Z����������W��?@��
N�~![�������p�o��|{������������?w���p��<������'�������Z����{�����Z��������?��
���o��{{������_����wo_���_>��_���o�=~����������a6�����O������_����������8���x����������o�/��~���?�~��O����/������x����w��/������>���G��am�[{�_>y�����r��c.����%����?�����q���%0y������bww��p�/��%��m��[F�8����
�S�;������,�:��L�G�e|�s�x���6��q�4��{�r4+-O���}D�_<���'�%~�������6���)A{.��&��cV$�sY��O���JB���f���~�������s���-\��B��:�cy�����P��������a.�c.��`��s���U&����s��g����z��=���\~�w�~�w����������hvn�x�������_����=�������_�x�a���rD����m��i�7�7��7�;2U��sy������}���~�G���e?���*�S����s������G�����/���O�K���<������}����7/�{��������m�k��z�^��>�>��2�����)�����}�d~�\�.�������7O�/�y|D���O�������Y4����/�>��[��G�%����c�[������y�����<�S������'{I7���o?�[����������K]���?�����i.���	��L^�\~�sI�U|�%�w��[���/�C��O����#.{
8���N�'~�?�=���-����?��������<\���?>����7��r��Z�����xG������"�B����~�u��\��{.���\.������������\��|��2�\|������5�o�'�����:}�f~���i�����~�G������r�~�;z��g����~G���:���}uX�m���:��_�}���������~���[��g�K�����������/�=�^}p2?�s����7?@i�qs��I~�;Z���y?M�B����?�y��2<�����d~!{�k?���D�;�������e�����T~�w������_l.O~����?�T~�wd~������9,���7_��/��e����5�����������n,�����w���u?������������\>�o�����w�^������zzR?1���sY)
���Bsy�/7g����im��r9'y�����;���������sy�^�}���o��?�<���_�j���r���������X~){�z��?�{�������V�����������u�O������������O��Y6��?������g������[�{��?���G���7������
k����i~����?�S���>��p#�3�r
��)�lMwC�������K�������^|��w�<{�����NE!�2HaY*P��)X��zYw�.�������t�b�M}��:�� M�4lU��>�3E�
�T�dE�R^�P�������Z�p#P���zs�a)i�t" �r���
0lPJ�\���D-4V�f����������:��\��D-��q,�?��$j1]xe�.������aV4��V�E����<��.Q�$�_@�LX�VH�)p	Ux�Z
�TZ����y���P����5�VQ���g�v�P&���r��o��Ze��$��u.�P+�,���(�\�R/�PKW9��C���Ho�Ej5�x��v�PC�<���]X8�3t�����@*����
SA�+�@��<a��'	�
/�Pk�Q��!��-�Pk�0�H��xs��Z'�%��B�hv{��D���XX���Q�����zvz{MV�<X�7y.�^���J���-�PK����r7�z$�T�����"	�N�s�����#�C�{Hh���[�������BOGe�'������Y��x��:���}�?�])R_���&t����E��5���Y�uBe{��b�:����,b�p�|=p�u�[���qzDJA{�~0����(��T|������.���63E��@�:����}.>�1Op�9�"TN������ ��qy��LB���h��zn �Dt�(�w�J�z)�Y���s{!�)�_k����L���(k�_�/����*��I@E�6���%�x���'�yV�?����Z�c�W/I7��6U�x~*y`�sY�r!���2��=������_�������=����xZ�{��#�hA4�������y��(�p�K�<b�X�3����5����y��4#�pE���YTH��������l�~�s��i��M�����{X����
�q�p���\��a�{�VC���/�nd�*�z~���E��7�����R8�������{�|.����.�\���Y@����;��@4��[\��	�}�\�[�]�%�Y$���2(d��	R}��h�z�� ����4���~�sx{�|��F�������S��k���]Q����I����Y�����(:��\:��zR��&�t{�N�Co�{�$�\{�RY��>I����{X��>:0N���K�s�Y��*���6���='�
$���z�K�@=�<QG�`w���^�Y k�A[+�K��"
�G��D]j�j41����fE�z|�z�k[m��c�v����z�E�3c�{���S�n���>f�86��km��u�Kd��a��dh(K�#��a�����j�n99���~�������hW�}�_u9���m���K�}���G�p9��095iz����k��!<S"u��?4���+LN;Z8s�X��q�7Ni����2`���(*�?Y������X)������5�,j�N����/u���
g�u��T�"^��fY�5?��N4�D`�?�
��]{R6\qY�c'�~5��G� �lY6A��?6A�N�j%������t5���Z�m������c@wLM.��/��<����L
W$�Ovz�y�u\(����[�a�Y���j��E7��T!����T�a]=���z��<{X\��P���R�t��.u�{���Z����C�\��YU��������#V+8�%��Ri��GiK�A)��EW4��@�+�����P)b9K
M[��=a}�SF��l�����IS���v�t(���n�����I����v|'�u B���w��1U���Y����?~,��B#J�gtz��=��&8d9Dpv��z���������G��@��^�u���R	��R������������Y�S3J��|�#U�o�W�#'���{qt��\�%�1�wj���F������S�U9�*��;+������C��s����?L��]@�u����f!�f�D���W��h�h&Wo������ q)�T��s�vZ������#��P!��=65�9]Z��@|�I�7���:��$�zN��$��;�M���P<������m v�����>������+�[W�Q��,�z�J�2�����LN��%a���}f�� �{����/�KsR/j�����8�z���s(|)D��A�����9������f��������t�m]�?�LQ���3����g��c����K���������8eO���8�tl��{�\X�>����{�n.�{\��,��|V�}X��4�������r���d��p�CS��#|�d�q���r�x�[�!���"��y�Cu���K�f'���ij�mV���z9���|X�"zJ��M���m��F.	6��;������t�i_�ea�{t�)�����|��}?L.a �C'M[�����"��[��q��0PrZ���::�eR�{\>�$}���4��J���^q��J�)�����U�% _�bR�?�,Z��z�MD�F��vMVt
�U����!�s�>�����.m���s��vX�>����HKC�E��2[�$Hz�pV��^k&��=���@|f�#w
����'d���\Cn���ae�+��K���F�{\>'������$�=-
yZ�v]t�QK��ii��]u8��S]���c���KC�- &o��j�r
9r;?���EzJ�<����Z I���v�MQy-6�2�r����>Lj�f�z��!w��U��*����c��,�S	�M��4�l�������l����=_i�6�����<�x��#�juxu�KC�=��aE��jeK����-5LKC��p`�E_MKC�m`s�*����|`�����c�9�g��G���!����A�J" ��!�:/2���IKCvl�*Y����O<�z����.�T�5������u77��n
�������=z���Vo�����u,
�x��WY�����4�9�d�(�ii��e�Y@�:�-g�{����/
y"��L��t\KKC>�t�j��y�KC��r�����;Y��R0F-����!VD�s���"2rE-��:�W�B�M�z��]���vn�h����g�����T��
Q���D�iY/�"+t������l���(�i��sj�g�0PL�yJWD�j�4��-���Z�3k�Q��r��=j��D���"L��R�,9K@�������=�`E�\*z��g"Fr�]��m��:�����l	�s��H�r�$-(-!y����z��,!�x<�)W��D5��$��t��R������p�Y���Zm�\�V��"��,K/����c������|���>F-b�}�av�������x����>��\
�Bf�T�E�vKL>n��H���';�|LU�o����%'#��Zoz��+W��Ge@ "]rA���1�D��:�%(��4���\�.@KP��2DH��4�����+���Y�d	����;�K�W]�]Pn6�*���yc�`(�&;#t-u����l���5�z�,��H���[O��/v'!�������#h �mEC���D�x��hH9s��ozn��H�	A�������R���z�����������%�\����q�KX�Uw���d�v��KX>>_��CU�rD���|lZ��+��^q�#�t]h[�{����<jV�W9�����n�,�MzE���-����������2��S�U�8=�L�����:��2�5���95����iQ�<�k3d�!��>�I� ��8�����r(��,��4;�b$�f�v�i_�r�p�!�jD-�_^gj�����0��Q���r�M�<�vxD=#�Y�U+g��������������N�?>b��@'R����g^��8�b&��zE���g��MB������rDY���|���K5Z���:K1��`�zQ��\�K}{���&}8������������i�G��iL�W��''�L������)��/�L��������%�P������-�Z�IZU������<���E�^?���}�j	�=�`6As���J��V��"J�r�GX;Kt���e9�B�O4��O�!�S�&	
<9�]�z��5�%,�)�I��M:�5�c_Pt��h>|vP��t�kF����+9�mv^�RvH�pVVWE�"���r�^���|<S&��kl�@��������KX�G ��Vl!�Y9��4��G�eh9kv^��{�IN����+L:U����)��5&�2+
�{tdEa�m���(gV�Cb(&�����j�.��Y>���p�����V��E'2$�����;�Dh�XvpE���ByB��y��Q��Q�=�u,]��c�FV�����+~�����H�q���
����>H��������+��+��� ��F���/��V6���D��u�=TT�N�	$��Lu��b�������4���<	�<���)9�<v:����+�u��r��]Q������9�cE�-!�C��,g2M�R�-�{����W�`s��eQ;1���$|�������z@��X�r+�����y��j���5~�M���:���Z����?�������#J�4s;��Iz�y�27Y�'I;yi�#PF�Lz����r)����v8����%�j1	{\�q��O���E}8^l�����U���%�y��\0���<B�<R��L>����t�uL����+j��t�"$M���)}8kW>���K�,��P�z�u����W`*����.�^yi�	*Y}��&_aW�4�T���p�8p+KCnc\�>�$�e	��sy�;�(H�I�j�dm	�����y +�����1�z�%R<992ur�����4�a�MY���Y��.��Hf���e.b��
�S]>�N�\(q�<U\�H�2L�e�k����4t}�RU\C�4	�6U<@q
yNT7'��O���zN���
���t�q��ZAvz���]����'"��g}�KC.C�6������{�]����WsTx��je��E^N���4�QSF�7�-r���)�,�@����!��S�eN��d9�kqk�j���L.���\S���*BPq�Ee����Q�K����.�%\���p�.��nV��''B<��~��c��:�)��G�\XR���g\Q���#���5�2U\�K	?I�r�Spa�3I����\���A��z9����zNA�H��ko�
���"\q��1��j�=:��d��$�����	(�X;��ei�9h,wD�uX����U�������������s��Q"�����2o�C[#/n9�~����:p������z��sc��z���^l��
�Ia�����%}�������=&��qi�9����L��|�QiY��Og#��A�9\�c�Po����.�u�H
&`�{��d��i-Y����!*m r�ta]rA�H��b�:�9���:j���i�4��.�b��:���eL7�Tf�R�����4����8���g��o��A�(ei��W��R,Pp1���;�H�=.
yF�9�L�m�!T�J�.�KC��.5n>
t��<�t���&���!_(��2��T��GCX�
��+�r��w@
���k�!��-u�b���I�"E��S]Y@D�h�{�j�KC>���*��e_���c������f��-KC>�j�F�XN7eKC�#)������m������!	t��<��@4�qi���=���29���g���V�k�UJc1�&�#{
����#��S�b����$4M�.KC>V�]���|���6�+z�+9E� ��=?'0�-��Z#+�G��aT��!�g����z�(��0�	d6s0�r.j�����L������m?2�j��F�X���<MJ�BV�H�V���Y� !S,��R����1g$7��FTD
�A����O<�X�:$��f���I�I������)�.
��K�,sa���������9ui�9k�KA�s��NuU���u�-���zIK��!?|���!=E��<���~�M�g{	�@���+���A#��d�Y��<�:V+y��!�:0�ZN+��4���f�e�������!��uV�Um����KL���D�A�kr�Q]C Y���<��Z5=F,X3���}�.��ZU#���:���rvy:���v�{�����5bC���
�H��G�Ty�.
y���H�V6N]����$G�ui��5�V�~����m,6OA.��"{
_�k��^�}(��������FQ�<(�p�r��5;��,
�T�����U�������.(?��NuYN�x��Q�������~q�z��s�@M�e���[{W~�x��=:���zEn�E^��0:��U����XU������.
y6���k�����`�pC�$B���<����e�����l"�
f�C�_������KU���V�8_HU���4�nhod�#eZ4��4��1���|�A��ui�9����[da�����!7C�)�j`���u.: W-�V�u,
9K���I�<��P���Q��jG
:�#��lR�Oy��s�F3�]�.�.���m�r5�\w�y���UG:�BJ�$T�����g�+��+di��u��GX�!c��M.\������`W�%:2P>�C�3.�� N�Jb�F��v�`<wa��=�Xg������\G��{�����T�E���a^k��t�l)i��b���4��`��@[��_���K�H_(
&��
����e��>we��(-�dR��9e�����ln&Ld��\Y�-\���^���t�2��7��d����\*	T��uA]:r
�#����=z������
���#��6
BI��t�s��?�]���t��NKBI���<H�o��j��
��!-��F_��mu��"����&����`���+V��7Ic��%��s$Vt���F^��r��w���H&]V��t��r�������U�������:��z�,7�0)Y�d�}���'��PAt������}�5S�����OH��������w3T�dC�*���N�w��O"&�6���3Y���fTR���%N#mh&
�m���>&b=�H_%�k�����[-�h�����4!F�Nh��@�KTs�����Sd������D73�@OhgMS��8�����k�e�2�QH+�0mI�G$����n`�iKR.+���
�4�lKRNf�U�Y=�,\E|d[�r�5k��v�jl��i�����'��8��3�8�">�Tw)D��"R=9�������-I�
M/7'���b.)�������S]��IBE������h+�F7
V���8�-P������W\�X�%��t��H������t�!����-%I��w.Uz'#vN���>�����)���]D3n�-�nKRn����2�pV
:5T_I��$�d���=��Y%�IvKmI�������4k� ��$�6�t�F���h�x�K�1V����a���E�hmQ����:���-I��!�vkB���+.�iTM��Q6�yZr!�#��~���������%I��#�l�$Gu ��>|K�c�T���K���ufh;m"�b�$1���{]����o�xS�M�\e��$@WH�v`lCo�Q����x�Z�u(�j��l�������'$��-��;�Z�J��d�n�$�x���\���_���7�e�$���p��v�h���Jo��?X3����-B���p���R������q@h��_�����QqHR@�
=�	�� ���J�_+��cc�m4�3�sBNn�b�D�k�E�n��!��+��X�&$ ����Q��1�V�r�0&/4��BN6�?���u��g��l^!���l��l��;6���h�E\@qF�:0���8�����r�9e�=�u�{m�7h���u���	bboz���%/�dA �!'�4z��t�eY�9W��3Bf�O8�00;6o57th��SN�
����%Q���rr/�6�h��AoX!'w4_�9OJK��q�V���hs+����gEH��O+������S@�W(�
9�+=���($��BN~n�P��FV/����O��V��vJj�G����������;u�
-�6c��������#T�V�j�-T���P�s��S��)[���z��pD���kd:������.?Ex�+�����jH����:^��>W��iM��+udw�B�<?��Z�#��&�1�Z��*!#����,M�z��9wS{���=���M@�(�P`����b.����0+1���������5��n:0��/~�nI�l��h��mv���hK��i��O�����]s���]���[����*&������^!��A'�]cv�vl��m�G>h-����Q��
��a�8tE���n Cg�!c%�'���E�����������-m�������#m�'�����t$�c�wo4x�6�N�����w�,:�!!j��j�K�S������k(���/���bN���/����V�^1�:�{d�]�%[h��	p�]z~���7��	�R��%y���	�����l�r���+F��1��+�&�x�E���@�9���w>\���t`j�@\Nx|&
�,5�S����R
����mq5y]u[]�H��Fbg�D[K���u���^[r���W��Yh�n��u�Yh������E�c�9'���}z}�������~���5����S-5�]yD��`�!w�q)���SC��U�<�9��	b�K'���l�!v�P��O��L��bo�
���+�%�[c��$8��I%����P�`�!?3�%yF^b����}���E�� 04d�l�k(��RzK��������\��h�1�=�}����E[����Zh�����A��R�e�!�}��Ah��>���S���w��k�v�U�*��	e3]�.��rH���$�
g����U����Lr���p2ZZ�Odh�V�t�%��:�8��E[��	���N��$�Y)&	/KK��nf�����S��?W�wt��k,���f��N���������p������4V�j,g���Pd���oL��/�d�L2e�x�oU�����������%���|e�J�j�!/�v����8G]Q�������U����w���]I-4�	/���]yJ��BC����������hbp]�������B�j��V&@�-����J�p{����4�ux�9�<�+b�A�BC��M{���*s54���Yvd�14�����']C����@�!��T� 4��4�s
h�V���j�!w�Q���qNj��NU��f3�%�D���5F����'��qS�Uo5fN-�9�[��j�V��Q�o�h����N�����F�k�����>�-�v~V�[����.d���@�D[��-$+�^1f��-��4����28-�����������(�Y��zj��m(�V�2�=}�����fP�Z]o5v���U�3P���
�m-�zv+���7'j����<�CO�%�{���(dF1&wq����3@=�CCv��Q�<(L�����P��\,��y���'VW�G
y��~��W�����'�V���Mf��2|�W�<4�=Y�x|h�p�
Mhcj��������Z���~�SN����w(���%:���\�#��]� O�Ea����<�WZ��L� ��u�RX�==�'��3�%Prb�D[��v%�
a���O��f�Z���#���f[rt��"�v�V`W��������{I�{�O�6E���^1�����~NV.�C�"r�Z�s�Z`�b��WDf� ��!Q�'��Pn�p��J����8�������<��������U���3����_���l/��H?��-�)�R��+[�W$A���k��]LQ;<���h�h5R}�CE�M���`::z6���tP�����n1p�Q%:��J��SgOT�!���O>��3��s+/��[4���R'k����MNV�s�"o��N�>"�=��:y �[5�J��)��>�&b���|:��7b	(�TO@\�d4'6m�&{V�b���]_G40/Sn��u	�O�<g��hT�N����<������|^���Pq��*��'7���E��#W�:OL�S�#�&CG>�����t)+����O��$����x�tx>��| )$��8�����|��i��|�:K#�	���-�CI��H7�'�%������n�j������w��l�q��F��[����>t�K���<�}j���N�����b�jJ�<G�����WI.��������Y.W�n+�9�$�S;��k,�O�CI��'�:
����$��^����XK�:�u�6�q�c�[��t�6�� �X5l}�����?�������a��]P�J��^8�;~f�2�i��I����t�B�X�k��l�G�:��DEa�f{SS������������i���y����I�^aK5y"R���;�������O�V�-���|�;=HB�������o<�����A���~BN|��o�~!��H��CP~F"8Oh���I�8I[4��+��N����'mpc��r'�br��31E�	K>��D�f�&%9�CQn�}���\��j��(7hf����Yu`�
W��z�]w(���{Zb������*��z��\=_G�;��T�#���Yr�+�����ud����HA��A	%;�]�wW%���lqw��0rm��$[�uHNi'��X�����~k;����Nne�������Ip,��$[���82WSQ>1��}�SM��&.��2I5;�y����`p����W�/��-?�?U�;�6t;?]:�������;��=��n�w���|n���/�����E�;e����`����������&�Z*�CP^�[�P���������{��WA�
A����ht;e��`�-7G�[}9Lt���%��I�h�:Z�f��/���V��wd��J&J�#Ikw�(�[@3^Q_G�'Ogi�1mEsJ[b��r��:��z��s�=d��<��[:X�S��q��6�t�KW�!*���WLWrE��"[����<x���-]����\�g�db-�I��X+��u�K�E�%�:p������^@����d��vU�7{����l1X�U���Y%�bv��������'�`Gb6��=�t2����;���v[�T��P1<;��uJq�W��3{f�3����MF�s*M�V~�CO�9���� ��,�U��'������z�oF�`Q�v���S}���I�����)�w�#<@(�)'�j���?gV3�h+(�Vwk�������1����
G��K�-��caE��{�5�dbxf���|���uaI�x�K�h���� ������������v���GXo��BK���������%����og����{��]&�ex���dh�
|��X����C�"������(ex�Z�SK.�������;����w��AS�MN��oi_~��iL�d�� ������%[������Z,�CI~����`/k�b�9�k!X���d�����/�e�SG>�e]�*�&z���'�GDVJ���#{q��F����"��]%�}������A����U���%M�#��sX�pYDU�
y	%}q���p\r�&���J���oL�	j�&[w���
�2�!��CE�� �,vi���K_�$�s�IA��"[-�i�V�������,[��\:2����B������+EC����4�L����T��li~%�PI3$UKH���e�?�}M�?#�f�D��V�=�K��I��D�.\�]G&�`���<��	��l�/��q��r�;6��d�����]�w��\��>�
I;���L:�6�3�q�R��U+�u��r��BS�1z�Y��vu�!����Ci4���ZBUE�Lv`r���6�����s��/�bU������M����u��#5u���#����u�^t���NE�5��W�}5���5������CO�>��M���3>�e����"�D��������|+�_v
��9�� ;,�[�3�y���,��G�g��t:{�Bl���yU��=����3E��6�]���������Em�����gRT�"�1%��##:�|1�TC�w���������M��T����D/cp�v�	�R'c����6a���[�,E��X����&���#��g�U�	�yEg6�;�1t�b�6Ff����"N��l�d{>�x����wf���������H����=[��>#3�e����33]w���0oSv��g�oV���sk�M�%�Sw����)g�E2�NGF��X������&lz����\{���gh�~l�:�!j��gd�C^�e_�b������et�c#��>}����&�f8�������4u&��O��ZZ6���
�{�����b�'��5\�{���IW�>�����g��}��n�z��8��
�{�HG"R�{�����EI~���)L�e/�M���x�C!D�&Z��<��&q�/,D���!�q�����2.|����6��g
���5�8����-v^�z��r��*�)���X��)G�n���Z�bN�`�H�3�VW��n��Iz:IaC��d��K�0tU.p-��"����}��p.n���g��}�~6P���X3t��������[��D.W&��B������������|e/0�4��f���5�%f�	�������i����j�7��'OP"��_w���Y���>�����N^D�;�!J��s���B����T�����n��,��	�04�>f�r|����������,-K]5���m>�,�if��d�3+<����g���^�����Q9oq.S��32��F��^�Z[k�����T����1���W��P��4��S����������l�\�oE�5k
�z�m�&aw�T�
���:Z�>##���@��'����N���h�^�J]C��G��R�V'\k�����h<-/��9��Mv�C�lC�~����J~�5��������p��d"���eS�J�4�U������5y���%�>#c:A���*���I�'SgcG2��j���(���t��s����dj�����OrO�m��}��2u�+����j�����q����X
c��5�������$�.d�V��`
�cR������5��"�m���:�h�C<�T��������W�\����0�e����/��5���R]�&��T�E��^U�j���:u}F[���|
t�zf�
���>C���R�1�F>{�	��	$l��5>�����dl\���?����+�_�2����l����<;��
����q���`KJ��+������P�S���������	]]L  z����v�����gGe]S�y����9�`P��)��
��#�bVlmhhBu7e��S]#�Y��������ZM��_��k!����
��Yp1��;�8�7�32�����a!���n�������R7n@�2���B��}.�;�JV�����T���p�k�:2����S�I���Z��Q'U���Xh�|�����������ete�j�ZS��O,��R�r����z�@k��&}�4u���Xu��B�>�l��aX��	2�z��:�#C�������u�\�X�<�jH��1<���]��T=V�Qw@���M�9x(g�
�UjR8��������&��Y��!.U���tD������l�%��f��
{_G�1I
�
�akM�s�����(b������c���|�N(�w�{�G�����.'	vv�����e3�����FaD^"���+,a��������H�
^7~g��6�1�7���GG�  
�n���UG�"��j
��X��^nWk�[vF�c!���-"V
�����L�&n8�.}]	F�2�`}�p���J�z��`���`�?C#�� ���M]wC��{�@Y�F���B���K�Y�^�ZS�^�w�(���o��8��6����]Dm	�85D�Y`��ad�b��I+�*!MGf�3	
���R��>OVX����K��9��|�pIy^m%��J1M����Uud��g<�����������4�>��32�D������>�$uLCy��DC��Lk��/){�hV�mZ���'���5c�f�Zy���9�'�Ak5�6���H�5$L�����u�������L�_4��Jb�f����������R�����mjKf�i�x'�`d�!���4��.�B��(��(�<�����IH�\�a�tm!Y��V�����b��Nm!Y7�X�D:�iPGfo/�j/�`k�P6<L��35��3,����
;�Xx�9��'HN�|�\���102K<6!��O�AZZ�����X�pTo!Y�I���>�6m �M��Y0�1��������USYK�GcUl�jR�lc��e'��
u|F&��]�D�������`������)���:�[y����wc
�����B���3���u3v����������5v��S���m*�}mv'�^�o��yp���%�:�R���T�9�����y��P���o%��#i��5_k5;�Vbk���>�4e���&��2=8�+���/��`��W�b]�_�G��I��6m�
����!BOHKx+���*�@n�<#��
�g����@E�z;N�@�>��dE�X����F�2>!��J��-��
W�W�|[�b���"��&������U�j�>������SG>@�?���^����c�Y�}2>�y����U��|p�*���|�������P3d��Umw0���s�E� �S�^����4���"NH�z��!L�����d�C�S
�9t����������kg�gd����9�FY��=��i��w���lB8����c/��
���q�i������uWX�v������0�����I��v���O���Z�/�R�����MO�=�s�H��
v�1�;�L�d�u�q��5���������_�����7w�	;��
����;|��9T�����4���U�n��,@�hYfK���Re�OU�*�-�j[DNXq������pG�j���-��>pX������N]�5p��0t��z$=���z�Y�:�1p�O�	j���S��+�QW]G&>fS�������c��WU@�����h�IE�����KU�'���D����E�E���aw���f���d�a�P"�32�Pod�m�>�C��D�P�q����E�J������T�4J��S��}e`5Q*�32-����M�f��A������`�>��p$mw���S{c7��XVb:2�
���2[X�z�9d���Py�6�����@
z�[���Y��g@������Y"�g���A���n?F��WR�0r�u��N�N�TU�������=tj;E���M
���r/lA����/a���+����f,�c4�xl��$��yb/�����^mh��IMf�{�����_}}�+�����e��uh�X����t�N�O@����#_|�d���6�����<�����@���9J�^��H�9���`T"T�����v�cK���5N@�����C���,'�U����R�N���C'�y�:2�eG��v�g#su���j�Z ��s����1b
�6����N�}r�f���s�tZBka��1��+�'�+Oh�W������	�g�C<��E���6��J�%XF�1����o�������w/+�Q'M�o=ujg.��2�=��9�����O/�y���I64<x��9�����V���(K��'�`���N�L�$��hV�c��z���O�����&�64Gay���������zr��S������%�=tj�,�r�tF���`
�$}�pU�����8��$�:��D��bZ��g�el�D���	�A�W�3�qX�R��<�k��-<k����>{���<�gd���:PDz�,���N����qF1{���'t�}p��D����H����E�l4d�E�w�shp&��������������9�9������N�+��D)�����m�DL��'������	�j�S���<N�^q�,W�/{��f���B�n�q�Y8�������KT�<]�������t�_g��Wz"@�6*���E��E�,������l_]��G����A�����BJF��ok<��D��,�}��t��n��}���3����a��t��j�3_�UXh����	���m��3�?�T���1����X��N=�~e����:e����jcq�P�F"@*���V\�}���m&t�0F"@�Y@Q<;h�n��'���^�=l��94�"2��O�=�w���j"��S�sZ�juR�H#t����v��(V��#���Ma#t��b��*�x�f��O���X�N��z�*h���f�;��"@
�I�.j:2�#����	%ds�9��#R+��Q���9���
3��3���.�0P�3j�u\�*��#����d��B�?�l���uh
�C{6*{G����[���0�B�����g��i���zv�S��M���=Q��5!t�| 5�vZ��D��j������x ��l��g��P�5�F���"@��Y�RQ�72��.v����H�=o�;�e�N�O�@����{����1�.)F&F�*+�=����z^�9d�]�`#t�f�V&���w�����%�HV=k�z�G���Ns:�������5[���9�������J�#�`#Q�s���,	 �m�
y������<u�c��tF@��"�Z�:�>�1@t�i$:��k�Q/8B���Y��3�j#t�m�t�[,x�N���{6i���N���@�#�S��x��
ZM����l�O����l�kF�=�����/�����TU����s��::u_�`w�Y��I��.,*�O(���^���3^Ha�7�a���������s>a��9s�LF&G���W�##��m�i]+�����7B�����B�
���z�6Km�]�9�J��r���~�R=�a4��eF�n�|����8q����������?G*p�����<�2�GM�:�>�T���v�=��X7�rG�4(
I�l�3�&�B�2*����k��6�Q���~�d���];	 6����'��H��*�O�����v:���	��~e/�����2��YFD��=���85��}�����6R���
�2��#����6R*�X��q$d��������T��Wd�8<$�LQ9�fOSa}�����^�^�BOvR���I(tKi�>6{
�Q��*r������$8��'����N+#�e�����=B�n������H�I���,I��w�shU$/�Q��G��c2��E#�v���"���O35HZ�Do�3J����������F;��_�*���O}I20�����L������`Z�:����������T�x�^��-Y�����zu�Z�j�)�L����M�Y�~Val����M��@���u�����*�����kf+!�
S�t3}���Q�,�2\3�"c�`���k���OB#�_S��3}����P�P�<��1�j�bqt\3���v`x�VB�n�%^h�t��m`��:PG)r3����>X��/��y.�Pj<5��a{���K���3[����O�e�C�'���'g���6j����z��;Q�����D8�W��J�����;�����q�XMt�
�zm6	������W[�
��O�c��X���z����f�����w���2�����Z�z����	1?��Yg666V���7C�^���9�R�|��o���V���V>7��������V�Uc�
�z^u��6�-����`����t����=3ru���td���t�������z����w�aWU����:�*����>C��voH+Q�f�^���PJ����F���y�&����1.ey����u�1b$+��6��_Ya�5f_�?�l��x<[�&��XdY��1��%�c�CTQk1��^�����H}@���z��]��J�V��\d�J~V��f�+k�+��e����������^e���`~Z��H��MW��Z�A����9S����
F����_p�f�?�jZ�=[��9to�x�y�����z���h���\6[���CG[U���Su���/�uf��i��L����[����M����4������X�w]���� M�@^�y�Z���;;��[E�iy.#+����bMH�G��='�����T��s�t��,����f�yYR���	�jB�n���8�l����aZ��Dlb9�
��q�r_	��N{N��t ��n9��r��u��+sB�w�����l_pr����(u�U���������]P��3�jW��-9��<s�^����zE+�J��
]&�I��lO��Wq�#���'b�i��>\6M�5cB
f)���C����f<f��^����(�����1J�X���!V{aF|;I�����g������<I�l�;�;�eE������l�V�j���kE��y�'�jW���B��g-Si�y[4g�R�>����
r�����S=���]��������EG.S�R+uj�f��bt_���{��4��.��#S*�~@Y�����wx��&�}�������P��
��O���UH����j%��tVj=����2�����;C�����t��Rv%��P����c��h|%������)�#��l4��}�Q�t��n�\M�[jxB������9�J�e�����RV��V*jl�kn�J�0���Rgr�75Ua�:2t�B�AE���y�����1��`��s:F��i���Z�y�=���Ct
�s���:�Oo�YRud�C��O���/��i��Ws:+tj_��\������;'�R:�^�G��A��'v����V!C�]/���~�H��������� �k2|�N=��?g�2���[�S��~�CX�~��l�Et�Ki�PW4����Y�7���c5I����T��oe���\��W�����:���si���
bj�i�'��sh�n1j��J���C��������&�q�P��z�n���U��'�\F�\ 9��:u�8#q������|��n]Ei+t�����U�LU�������`�s�O�z��T

�f��cj=zf8Z���r�N����.�F�:og�|���!����8o��B������S���=�O(���<�{�������
R���Y�S�F&`��>�tU�"�
������k��������z�
�he���o03:�@�0]���('W:�\�S?�t_�G#��S5�:u?L5u��`-��pU�
$��;��a%��/:�\3!���*�8��#�N�{�������w!g��e�9���P\�kb�����A��#�e�n��5�P�C�A�n��+�5#�9aC�[�MV��������	�s��=Q��g5Be��j���*K������mZ��^��n48�;�&���o�:��&u��b���RO?��+t�v�t���������3���+�Y������s/c]/��1p�Q#���.�57����q�x}�d�c�O�� ���*W�w%�������h����C$����1��G�4U&����j�m��7������3>������39�6����#j�MB�SOt���O~[GF�����Vu��z�O�z"������/�c!�J�i�S�~Q<���=���q\@:�Q+��%���d����MU;��T�x�"�1g���o	��6q����_q^�8�:�^��=Y�����"O�l�,tjs�hP��|{c������:�s2`^�i�Tu�zN�:�5�����N�/U`���F�����r���$?�I�>������V^�g���XwAK���i����;s/��a��&F�^v|�ZBe|��#c:�%}���
���q�?yE��������.�zM���8�mE�3�/[�?�j��^�+������Ce7E+��T|�-���t�����M��3���J�?NU����h!�%�cP�X$Wj1�����8�������3�#�8����<������q��R.B
�6����p|5����O���8t�,w����[`?5�z�*xB���u�D5�O���J���km���A��zB���������ad������D��:�>�����C���*{Z���%���5����$��8a
��������M�}&��Ps��oD-��2��t(�zM���0f|���T�INgJL}*�12z�52�.O��",t�Q�]����3�*h���Sw�3��I������y�uOh��!��O]�B��v���H\3���'uQY�&$�cm���5�q���]�:y.�f���
x�&����.����y%�X���Umc(����N�
���o$�Y���TK��8+�>���C�_Q�������Qq�����1���K�\T"�Wc�S?#��q���,9�{;����Sm+���j�y���
���'�1U?,t��+��ck�i��B���n�4<gC}+I�Xdx��sUb�������$c�O��{��2���QG�P�CS��V��+�}���y������J���q6J��Zs
J�����u"T����������K�~���`���p����gv�@
��N}\��_�N����4�DY����`��6��T��j\3rm����#���~�V���J��N}jk�;����&��j�6�P����29����Ju-�e���[hm������I?����������j���B��H��?�i�g��ke���#�����9C�qo���� �mvN�	k�E:���?*�;���a{��h�	cNU!����i��	�n|����S-x�{��T��\��&�N����
���?$��c��T�7^rK[H0����/���o��3�/���$$����d����=Lz����J�� p�S���kj)�O���|�l�;`Zz��8Kl;<4<��v��ua�n���(/����\�N[�v�(m������^SW1O��K���H��������=�����"�����C�^�X���Z���V���H���!W[!��A"����q,Z���c�n�E�/����'���������r��.o�}��Z��z����7?�v�y*k�O��V���,*� ����J��#+`���m�X��y�#>�@m�+���g���� �C�����n���L[uaixc�_Mk{��� 
�����������P�r���e��xB�=��F�����9�/����`d�w����z�������;*��*���xh��%��P���1�����\=�P�>��V4������Rk>Xsru�V4������U�fjjz"�F��[c������9t���xx��02�
�PE84�'�
6OIfb��Uwc�51�q�_������Lx�?������<�S����!�fb��x�H
��r����N�5d{�U\D��`_	����&t�+��'����������!W�y���j�;�\�v�jh��
�,cU��=�0
�jV�Cj�9B������D���t��q������m��j
��?��k&h���qz51���6"b�
���31DE�gA-�l��g<�]���\m
����R���\���u�US�P��f
qB��������G{W���X���@G)����W�����_O�Ga�����}���w��<e���^+��8�LU�
����[W����$�02R��O�*�QY�m�?NL��>9�t��HUY��;R��^X�I�=�����)�fw44���O:A�-�h��!W�i���(3�*t{�?(
�q�P�^���(�/����g����?����UgBh���~t��!�
��`L��nM�xh�km��}���3a����v����T��sv�d&Ruhw�;����sA�
��y��omM�t��x�m���xK�����8�=��}��t=#����0�����������ZG�U��5�0��0���u� *p��VT��%����FK<��;��6�]	W��f�P��r7j�x���g���vI:����0�:6m�'��������( ��S��#� �^f�������8-�����5��S�������1�j�9�0����:���!Zl
�,�;tj?���5'�����~��q�����948����F;tj?�=uGB������i;"cdOw��~@��n��&Ff�=�]:�������O7<��u�K[u%�Ww�F�NLugNgvX#
�������L����j��������2���g���w�N=���Z�������]�n��7FF�����F3������4' +���v������f�*�;tj_l��l����:�O����0�6��c�����0-����06���"d���kj!;w��'������v����,��_�o%u����@�)W�������:��5����{��G��������/0��`����%��(P{��d�r�U�j2r/���/��O���W6������M�����5u&$����:���IKO�L+�����F��w��n����-�2}B�S��������t�hU�X5S�>>����j�;u�~A��{�������(UM�;u�F�N���Vz{�%�m�@���N��t?P5�:"�����C�m��L�`�T����P�
5{�_��@q��AD��:�sl&r��[�R�������[	��7�^����V]+l��u�!zK[u�J������������5�����g�N�SB��PL�P�N[���UO4L0n[�e�V3�����Uw��i3�A�q[������m�w����p��Bv�>-��F�a,.R�����>�bk��t���p$1������^��81&��^��P������u����}i�����;	���z_@d_+w��sS]@�T��C��>�+�f�	�S�4%�VFo�e{�:�eOtF��5u���R3�PD���|�@V� ��XB��V"9��]_Y��e��x%�����s
}.=4��/�c�������-Q
u�2�>�-���wc4n�h2\���H��({��S?Q�meQ��?�O|e�b�V^���J������n���'�d������l��Q"�R����ll�>�s���\GF;��q
��(�i�������	�^F���Vx�8�����f*r�|Ff<d����'��gdb�:�@k�wvYs��y�!)�j%t�}�T��@-��ZI�G���5A��V^�G�DTp5��Y�\�7Fj���4����<���NmP�NT*
���32��GA�n�X���%�c����s�k�(���5�3��������
�z��Z
uE��&�N�����+�[�P��? ���[�XZ��:1'@v���������,��5c�5>h��L����'�*t����Q�q����C����^�;����x�D����:�tEo�o�5��qf����
���S�"T�Y�����N}4	W�%�y�3�6|������n��E ��o%1���^Jl�!�[x����D�p]�R�N�.�(�k�����.�k�����N��vS�6�e!�n�I)F���To��6z����������m�C�qW�������R�RcL��_���`�! N�z�k��0'���N��S)��q��N]
��N�Z��O�;�����u5I?����gEt���~�Q�K�h7���V^?uE�����bk�N��6�;��jR�C�S����	{����>���q�!�M^�G# �}�g���?��=V�,=##�	��j�C�z�����R���j��Y/=��vO!;��N
��/��U��W��l��*�������!,��{�nC:�����D�=�H���[F��;�J!������+9t�=���u(t�~F��[��c��:u?�P��&����:�B��G������-�}�S��_����;��N}�>kt��(�����������:� rD�����[I��[�gd�?��oP��m�?�^���4�L<l����r����_�_�_�L�v�����'�S}i��5�����9��0����x(t��,3������=#C�t��,�BL:�>8E$B����:�q�j4���mR��SGR�Bl��?J�]4�b�����&�1���7A�6����S_���Z}�'Y������o��Zz��d����jg$�x��:�B�~��nB��8`%��������/]�R�XD0*�>\���q��������w
��'�B)V����X� /����:��	�fI��32{*��#��z����H��@��A�!V������HCi��3r���*���
F����do�	�Xur������L}u�7�JlhW/Pkdh-�>�n���T��!V?�CF>G
 Y���5M����	�o���F4���$����j���Xg[S%��w��J;|�%�U�}����fG�v��P_�K���{M�m�!G^o�����X����`�m�Td{cQ�&=�Y�?&�-Af��Q��({v�	!V��~�%���
�5_pC
������9��Q���/;���U�9!���<J��Hx����d�B���>7� u4�Ru&$����5~+X�[���U�AP]�����V�}n�����b�5�7l��Lx����(��do}{*���i��9����}���3V��x��������o�4��M��� �?D��
O��	]��X��sh�T���%���T����E�Y=#3yOh�<KKQ����-���G�Uv��`�!V���p�~T����	*�B��}!V�!�/�w���C��1p,�>�������X�����bW���,��kf��p���_Y��}���*�CW����)$�������#j� ���D�j���@�c;�2]B��Z�lYH�������H���5\3������/������ Vo��z���\8�����!U�ap��F��b��K$�J�g�=�����x�����S�wtPf�7�b�=!w(ls�"C�^����+���b��#$�Z�P��������ATIU�fOE$)������l*p�T������RX����T])�o��4e[C���	(���PW���8�*vF�����eq�#�8udO��N],�w��c��o��dO(�����lX�y������� iO�gdL_+�I%"�j[��1z3����.Zr�#�k*3!���Y��'�7��L�GSQ��v�R;4�����i2���R4�R�U]���V�F���gribG{������x:�h���w#�?�b!�7��6����e������	��n�w�p}�	�s�>���j�?�nWY%N�!U������RuMS�-D�5����&JV�"NC���f�N�8"�ur��Q��c�����R�������:B����9�|�;��J�`:o[����c���k<��q�@��o��z�������9Fg_K���n|2oi���������n�R�4U��/��oX�;��u��.�p#�ki�F�����W:�J�C��\�ad������AJ��}��4-j5E��o%tj�JQ*����Z����<�;��WM������76D�f=##yo��?$�����9��qq�5�{�P9=���Yt��-�h-��r_!TJ��V���_qq`k-��t8�9[g|������b�}V�m��O��������43z!"w��P��D��u
��4�v�w�}��,uj���	��8E�:���,��}V���S��_��/<�,����Q��z�!�s_A�E{M��������6{*;���Y��E-{*�����y�f������G�$�6��g�C(M~XB�����gs5�x����[���4���T]��oDR�cK��^��F����4U��s_S�v�:u���Gy�����B��^�?u(�
��:�z��(�Q����z������8�
�Dx��k�E����ahn���z4�xc��:X�z�|��SkV\3�������F�:���%���F�����P.������S���q��AA�5��8���]����32�;��9����Z������P�'��S�:�~,�c�	������`��f����*�#����T=�f-�+��ji��b0�[��i�>�z�8#���g�����J���8�������8h�bI\��f��	!�WWK�Gw�+(�4>���ch&�a&�N��v�(J�������SE-Pr�,����':�9��7�|
#c����`�q��{.�Y�O�u�%������B�����:����X��t���xBY Tp��VM>���4��u�����R��Un�C�S�^�8��1XB�������@��\6	�����:���V�g������jBP�~�i��k�Gv�U\3��sBCGlv���,t�yPO�.��k�M��������N
��6�2�����:�ZZ���;ONc���l�x���u�����e�4��T�S�����E�_K��{��c���g�?�K&y*�j�����N�\��v�����8�����7�m(����J��k���;��:u�l���7V=t�n�PG��2�����:���IO=t�
CW�k�v�s�9��o+X��32�r����&�Cs��v.DxD�-���v�����-�����r
lz�?Q�~[r�w�!U?���)_I��=������^�����?�hm�x�0��Mo�����V���j�����S���h��=����y���1��S=�`��EB= �]����!�qn�5k��%
��Q���&d3=�
e4��
m_a#�I*������Ow1�����*Rm���z��0�V=h��g���4��Px�,���y��T�-�F��\�Tq��u���X�U�Z3r�'���`z�F'[3��Q���M������I6��6��L�D�N�V��zh��o�T����h}v
��6���r����Pi�6�0�`���i^�c����>#��Va3��j���J��*,�}<���	�\��y�o���6hn��T/�8�������lz2�hGu-����b�q*h'��G�����Y�P�v���i��������/s��a��0�O�kh�IO@�
�Xt#��#����GM��}��4�^�����E�z=��|��
�
��T�]r �����rz�
���p�������i���C�;tq(T.����N�qM%t����J�^7�V��jN�K\�`�_�5��$$rR���d��4�G���8���=
�n�_�����Hz�Hm�zZG���v�����C�v_8�]0LuG���Q�p����$�@t������4�����~���"5@'`U��R��W�bV����'��U�5�����C����B���1����B���hT��w2�O4���-��O4� ��;�z7���P����x�n���b�F�w�H*��z�����l���K��f�z�D�t�d�vt��=��^y���]�8���\�7����
:��%<���%�7�v���7��,�����VO@%��uN���=1�Cu��-3*a��YY@`/�-V�T��r���S��W�
G��<��M���J���=�Wt�������:E�K����ANU��"��+������D;c�����b����F���>p�0�4����ld!f�R�PCi:��4hW]Yk��64�u�f����r~����[�1���}����kF������B)\3��fa��Ek��gd�/�E5�(cd2v��W�SM�#��g�B�^�~�����������g���0�Tf�>=��4]/vf�����i���[��3��L!�(P\��U9!dD�_
���4���dg�����%Yo��Q���BFcQ~�!�Bvq���?��!d���v�$���-���P�rC�hC��gd�!�����>!-
�c�yB���7�`	�j�$q�������C�����g�Av�zH��/M����EpS���c]��g(xBA�+D|l�X��i2�/���x��5c���0���H8��-�}�`=���5���
	�����b��&���P$N��� -1�d
���iQ^�p���+D�
�6J�G���x�w��R���q�Xh��dZ��4]O��N� f_�������kW�H��������}B�=���r5���x;.`1v@��H�u')w������~���
�I��P���2rs���-�����M#FG���gsh^�F�6�P=s�q���9�p�N;��+�6����j��q��4I��@��B�M��U���g�A���,���?��va�%R=#� �J��O���G��1���V�kf!#1�,/m�3�e�_���q����K���8u�bm�����`�V�_���9L�P�A
�}nD�����u�#`�XDcM�������������!v����3;.Z������v\,Hu����5���Mv����|%�F�>(�8�e���^�_F#M�~ID(�/XM�t=���������ks�;�V�����e���k���	����'�p���XMx���NY����j�gd����Pa�t5���m1Y;J����g�9�	--�fGUy&0�'=�s(M�sm-@�16	�����:DaM�5c/��0���r�
���P,n��R�/]ShR����jf��}��	9�L����K�#���iv�A#���� �H�����lB��N-*8���+W0�w�gH�k�	���9J��T'����w�E�C�HD�6���k���#
���X�hc"%V#t��
���i�S�#���&���IM����z�= �Mfv\<XC�y�L8����L"Ff��+ k��>���1Q��f�A~�Vp�Q��L8H�y�*�����n�<�O����y���T5��v\l������W"���V��L�z���X����{�Pa��9{��z&��	�������~f��^��kB��,���mp/c����M8����q��9�.��>�s��j�j_���{�q0�2������{�����>�m= }{�|� �{<^p���������>������ ">4���N�af#����1S�Z(u�	P�*K3tj|+����4������������@6��NxB���6�z�*�j�0C�~�A�M���8b��� �n�E��S?�H����Og��q���,��5��V���x��5_�4#TU���p���z��k>{�N}�xB�	`�N�����,��L���lG�Vp�	�����[��R���s"c�hM��g�A��^V`hDi�N�a('��3^��:�W�1���n��~��V�#!�����-�#!�i�;�pv�fo\�_���#��e�1�3� G�n��X�P�A�U�J�>������g`����f����/�U���NM��3��8C��F�	};�P���<O��M���Sw,~lg2��l����������<[E��|5���2N:X�rFB�d�4��^8Msk0����p����A���A*{5�g�S��\����H<�������X��T���c��H�z������4�����%6�H�"!fv\�jq��wbJ�e���������L��]�e�U���S���J��f1g������v�	�	9(�Wh������/��{�P�e����~���$�Y�G�N�����Ij5�B��>;��=�>��U�����M�^3tjsG���"#��!!����jo�������4�u�1zM��:�uU��pu���Wc�`��-����T����
�|�3t�~4`�g?]O:S�v3u�����@�F�����-�}�&c��zN�[��	���L�C�_�����cM��D��LG��W������������U73u��f����
��������
,D��S�����>1^?5��n�	��f������i�����\�S?�8�F������0�/-U#�U>s������1����C`����p����R)���J�z�~E�
%_�Y���+�����.D�cMG�:��P8BP*�6��z�������^[[!J�������Zj���W�&+t����9N]�U�^�S��.���������~��
m�rvb\3}��M6�Y�3}������U}+t�ZN�N�X[���w�f�[�_���
��[��E�R�3���5���:�C�>yY*��������(�`�����[�'����a����;12����:1	��� ,��%����^J���E+����CS��?�4��|c��Y�x�R��J���Qa���y�WX@���em�n�FM?�.��l&}
�=m}��Jz�sQRd',e��4�W�T`Z>�n.H�sA"(F������Jz��84���$��]*-�����#|���g�'{+y��%0�CX���Z!Y����m��W�������P�%���B���p'������+%����Jh�����Wa�B���ns��=��������|��]�Z�2�FPG�u[�f�����fQ��Y�E���Q���f���2��I�����[|fX:h]���4U�\�YO�>g��"����i����,*��g��cd��w�^o��XC[��Z�������q��@j��-�X�����J�zR����	+����7q�j���N��;��u���~���=��+��������Wk�n��.eg�q��k�/���i���i�A
c�X���,�k
�
�����q����}&�&+���b��Ew�X9�C��m�@�v����������u$�[�adv�X�;D
��V�@������1��z�eL�v ����+����������3���p=�8P����+�����������r=�E���jH������ic�[Epr�r��C��D�66�]����l15��^���k"���=��A_�$Z:U�����������%����i�����H<����-O�Q���Gt����v����<OK&]r���_�t���7��@��S�yph+����i�L�����<��7HST�_/�_@������=m\7Ll>����n�%)W���9������B��}�����D�\���B�����Wc�����O�����6IN�?��g	�']�?w�V���	{�n��O��RK�,� �,
8�L��F��*���R��g`K$�����#�]�������he?u���#�
��T��>�0"5ZQ�>�l�i��p�����H�q�apz
�����K�B �����x�?2�b��k�������b��w������_�B'<�(VSw������������3R����
�3������I(�Z��i�+��"~-����\xBZ���/��H�L��������bi�^�W���1�:~g�a4�%Tl����];����v��X���*�}M���9t�jA^,��o%���O���k~gj���:�.,7����\5V~�i���5�/��vmN`T����
�z��u�L���i������e�{YH����Q�d`d�E�s����M��������>Mt��Jl�}�,t�������P�Z��g#Ck
H{��t�;�J���%��9t��*���]0tk�����3
5�[�a,���9?�m��S���;������9��9ik�!�	R:N�QM������Tp��*9Z"A���L��H�FT�&�g"A��ip���UA�D�K����yt�m���A����f�Rd![��$�)@H���B����i���*��f��+�Ql�+���a�������������kb�Wk��l���W�����_��n��"�O$H%���p��C,��,����������EI3:["A�4"���H�9����ENZh+�e�S��������:�c�A��8-�b]
#f�^��$X�d����P�z��+���eI���}�3���`c�X�t��5A��}'�"�.d�����b�~���������@"M��P���M
e���K(����rW�x�������s��b���5�����AK�5:�����C��IK�_);�V	2��;Q���i��DRV����-��LS=�u�_���������n����&�����u�H�bY���R?p�sM�Y��������)�/;���{C4�]0��4���$uag���1��kz��[�6��+�E���3R��~l����;�!-��(_<�V��uh5���C�����d�tM >�w�X����+X�����9��\���>c����VN�������`UG����.�J���qb�����H��uFoKG���4Z��{������nt�u�m���Uh(B�������E��<u����"���������H���C��SS3���F$$_�f_m����>F>��}�:�I����1��k&_�dy1]{��1�.>����	r��_�	�g�r�D�H-�m��k��O8gO�m���s�!�����/���P�6� ����5	R'lS,����S�1����=�/u�g�*v?�x�S�F���[-.(�
���O6@q���9�J��������td���^tj����
x�����J��A��z��8���c�f�:�^W��b������)�A/l8?�;_�5�2�CV��=K�8�	�@�������P����uO$H����h�fZ@*�����%�*e0&�5�������Kl�|���8�a}#���DO�zn�>��:�:uC��	X���Lkl�f���D���Wo��f�����H_�:#GWtO�]]�Y(W����������0�	������c_I$����-V�D�t�����=���{���[I����$�wa���Y�pr�r��M�u(t���|�O=EB{������w��%����Qj��{���������N�g���n�=;� [�|��Bl�g�y�c";�6��i��lO��*.���^
o����~+�9vJ`HX`��3� �*���NUu��-���5�K=u�"a���V�(��	=���t�W�p��,|,U�����td$]��jbm��z�����`��&;<���r��F��}z���B�$�����U_(�� �d��1*���c{�7 b6�48�'���:�/�Y?�d���~f�st�;y2A����`V�V�V��Y+�6���mg��uL�m��<��>�[�!��z�������b���C���.6`��Kg�fI���������@�:�����J�v(��=
��t�I^���d5����`VA�C� ���^��:��
��7ba#�j����4��d055Q�!U�A;�@9�p\3mhZ��S�GPB��U�3�*OakuRA�b���`��UC�,�����U?�����S5��!V���9�o�����y[�P�UC�r��YS��[�����U/8����OI������;� �"�"�V���
�W�l$ P��C���q� ���kF��]j���a�%{�`])15����Kr��Kgq���zp2,�
q��G��X�9�����G��^�y
8�5�Qe�������uSmr(P��k�z����b�n8+���y!�IpB����*����3�E�X�� �C�����ZA�5C���)��2$��k�\t���@:$?�C�>�R��@����ANR@g/b��7����f�e����A:��IV�F��e�Mc�����.���},����x������2��l�L�w�D��h12f�����7�I�g;!G���f��N/ �������R���������]�����+��v�@��S++�����}�Q\n��o���x�Z����g���^�@����Qz���C��
o�
�����4k�GjG�(�h ��k�:���G6h�;5�}�>p4�Vi'��azC���=���M���%��O(76CO�L�{�<j�b��K�	)W	����w��p�jra����b����`{��v������W3ss{t�^�f�gnn�����[���g�J��u�C�l�-{=w���%�y����O��Z6�e�OTHi�7i��8����OfL��&���vh�����7�`�8��|���S����ee��"���G/���F�4�_�`���A!ei�M�������O���p�w{+�_�}����.��$����������&-����/�����N�uqx���.wH���Gg�QG��R�\8XG���g���V���G��
<������}�����IA�S�F����<6����4�/���9�w����;i!���A���5�b�h8�/�t�3h!h��1���� =$��6K��w��������5p�rb�gnm��C�=�1����aY��4p���vR���k��'6����h�_X�!`'�_W�W������7l�~���@:bCLZ��P����vs)�(��������r�3���P���M�����H�#�������Z8��fZ�K1��M�
�I9��@�Q�+k���8�W��]�P�A�|i!���gLf��d�_]���p4D���}��x�H!���W��z�����b�^R���.���ai,v��k����L\�����X=#���������;������o"=�utt�[����3�Al������a�W���X]���Y�V� ���|F�zT.D�����������C�v��s(q!�
�6�����k������u?#�<Vs��B�^^��U�	P�12,��T#���*��z���Z�	g,�V^\�bu�r������s�cd�B
�:u�a f�^R��7��������7��%�d����|m��G#�0�~+og���j�hY��k���Rq\��M�$�6R���V>!�`����:4Fg~�=W���,��ew�S��;#��:�k���D�P(����>��iZ�]w�P��P��B��(_����xP�V�J����*��U����S��
����^^\�����J\��\M`�U$T/�bO��W^��02�h�p�5��VLg_h��8�hD#�2���o��&����K���v��%q!!����
>��|�A`7/�u�g;r]����{Y�B�}�����32�3��x�z+���-�NY����y>���DG�?r��oo_v��v��	T���B��}u�������!V�}����9tu�nH8"����t!dU+$��4������W��`!]�JaK���Y�t�������������ai�l���Dy��8��w�V��UVe���]�B����v~�z��9��r�qvh��_\�&A��R�	i��d������`�9)Y}+�p]B����z������d}����? zK���7���3="���l�$�Uf��6I}�����
U7D��:9v���"C�8����#��b����Xk�
�s�������� ��6C+Y[S<�L�D��
����{��k>�4�I@��7�xG���=�����[�/�B�#8�F�?+���0���9�}�����tP�A��r!G^B%��[P���1g�iF���9��Z��$������l����Y�7R ��/
�`o
���]'�j��3c2��t����6�����6�hV����u2W�,%^m<��v�	�VV2��O�����&^�;���U�0�2���H�C@y���|� �6Q�h�L����b���e��.����2�����'���2��T�+)���^dw�0R.�Hv���D	��W.�� ���x�B8y|"2Vo��\GJ]���I-o^F
���C5q���c,6���$����Rs{VF^v�$*��(�Qte��CdQm��~���$����:_����������n����L�l��oe���j��?SI?�9�NN��>!x�Pm���Q��x����s8��l:}����b;�%���UM�����2��c5p�u8!�� ���C���uN�a{���W.d�d�vU�|���L�!�4G��\HE�C�����8�/h�z�ZB���'�����A2=e�S['ip`Z���N�����<$,�����!G��dK��v5G2*��-�j~m������V�
�n�?�}����!�!�2Tj��S���3���x_
���	���Z�gN��R�do�N�*)��()��C�S�3�Mcj���=�����4�l�1�|�2���&{�=�.[��SLoi�W��~�42�_%jF�zG�ah^V�6T�nj��%����o��f�m�u��M�ucT39�68����kf�f�5����J��0�S�\�c�������GM����L`����O����!<��T}M�_�4�get~4��/+�0j��8�'g
V��j��>�p�+���V=;?ja����JsyVf^FJ�	kd<g�?t�U������i�GJ".����Rs��$R��V�*I�>c�0'����L�����7����W!:Y1��j��4���y���q������u�WI����d|�<��r!G�;D|H�g���j!��yY��0^��8b���f
���(�g�B�E��W�+S��\��1�JZe?)�	h�O�B��~�����c�����E�P���0&�����u&S�����f�H��>tH�:��E��h��
����"�]��#�c
M1-/�I3y�g6��/��Z��39�G��B�#�&�-M���(2���
�l_<�����Fc'�wv�G���.#�P�&b}|6D����6������8(�2#��k�"�\I�m�1�6��kB�V����U� ��!�$ZnH��\9�PW���������d=!��1'������a;��h��>)���d�4ua����\j|F��R�����&�R��vh��[_��`b�������T��|
���b�'z5��]_�~���Tk��q})��E�;����o7��^��^��Xw�$p�m��i]��@�c�;te�a��\(N��;�tm�}z�*�"�d-�k���Arm���+#����Y�]��Z@�{r@,u����}���p�+�:~n�QgI�3A����mb�}�	�4C�R`G�I���^������.��JK��ziLu�T��ZJ���,��Q#��[������$����
��������T�44hZ�j��1[3��L��9�d�`eZR#�p5m6���FH��K��_x��H�\�m��.t�����W�
�pz��W;dW��6 �W;dQf�S���j�,��si�~���
��rl-	��Xd�Yy��#�LG������M���!����b{��!���j��ZZ����������>����)QQ����K�J�97����)��e�p�*����u�$�aI�E��6
�7r���h�:5Z��Aul7���m �{�:~n:�S��J�����)�4�
�&�*�^����#�B������)]�e������vx��Z:{\�4��`����=z������fp��T{z��I$��p�� RNo�����6�,S��������C���JF49�G	&����t��������2�MS?������q~K��z�~�f�����������!���X�kUa�������v�!�x��@��w���;�!\����t#i�5����������+�U��Y=E������&Th������&<p_�=���gV�o����+�����X�aI���K�MS@����%��R�E��Aoa�����H='~|��Oi���qjp�^���h�n�g}�����hw���c�@��7�r�CSB�����X����"��(O�[���AM	����n
����������.��Rw(n/���x ���jp�t�X
�'���W���g�����}�����U�����q��3m���`}�o�n�I6�����m�7[���D���G:������-$�]�B@�������5M�~��_=i�"R��i*%.q`^��A��p��S��4���}u�Z
.�6�i2���MFelm�ARi��p�Z�j�q��;��|��LW�LKi�3�����J��f����hD�����6����V��~�5�L���G|����w�+�q�L�k�*�#X��k��N����H�q1��
V7K
�YSs��ce*�_<���'���N����:��,����� �4��T�6������V���w\���?	�r"-n����g7���f����v-^���F���V~E-T��e�l�3Q�@���������U��RCd���`�H��r���_��M����^�CHt6��2b��b�T� ,���3��H?��bCECld{4v����x��CC5��2���Hagz�Co�%7�t�+��C��"��,@���������jdW�i�!����e�����Hc����yl�!R��P+�w<3G[�����/�P��t�K7���, �Y\�����=��N���>S�`K5���P���k-��y5���o��,�l����!-9�lq;(M��I7������c<���||D�3����5�!s�J����/�Mnv��+
>����f�����X�4L��g;�E*�T����3�#��M�=S��g��,5D���g��)*a�!�l:�20��d�!�.�)H5����
u�zT��W� ����aCL��n�n����P���geJS�[T�_xffg� ;s�`j}�!�nP���W�����6D��mj��w��_-!5D�������g�
y���Qj�����-�cx�����
z�k����7������q�U���etfB�K��d��5��b������i$e��XY?N���b�+mh�~�'��]���|>(��s���+#m�n����G
�z�6���c`����3���lx���5>�����!��`����6������������!��������Y���7��B��Y}���S�1Ko�.K����4���CU>-0�YG
�.�ze���GL2��ZV��������"������(O���Q���~�^��E���`��sl�5�J������O�'�
J�\r������������F�j!l�j����5_�������D1����RE�����1�C`K�3<@~m��k���,�SCU�(��v��iC����a��
���W��
d��W�s���_����h��
���	�e��&���G-�~V�:�\�]x}��-����x��{�\�0�:�\�2 ��3
&������l�Un�����I������e
����j��[�`
aD�^|>�)�p=eD����{c���	Uo�i����teJ�Nm^����9���X���Y�4��U�2)g�+v(e�*�(e���2lh
����t��/��IW�AF/o	�HGN�<3 �#{�sw �V���S��H�H�b`�O��U{�Avv�����M��GxLB����`e8�����M�������5��'�T����������X@����i�� y�����^�U�?r�_��{@��H��)��P���pa)m��U����E�>bg=��Uo!��}e���1��2��_�|g����.����}5���k��
�6e���tu�L	�;�h��X=��1Z�<a��uz@�6��p(/J+F��S�z��}N��IO~�I]��i�R:EOz�]76f4 ����;6�|�T0�����Tl}2y��}���Y�@��a�R2}��*�||B��[Q����P���AnA�'=��u�3�'t���k�^}�;��,����E����
���E�M%��k#1���Rs�R*z@������?+#1��V��w6X2=���I,���\��6��	�8e�{@��m�4�P��~�]]�Y��~$�5��)K����U�[��K��H��#V�:p�������=���P��l�&�=����@F�V&T�����+Sf��P~�T�T�����P,!��G�3�t=+)w�d�5(@��sdL�y�W��P��;��������u�U�H�[�({@�����>a�=�H����������@��0����\��;��>FC�������$4>�S%B�[�\p$
���KYo�&U����p+���Xz@��39��#
��rF�����{@�k)� q@��T�N9R��=�>�W�����:�Z��)w}J�U*e�������&�u�m��l'�^��[��������u�y�3������w��S�1j�����B������>�����J�&)w}��t����boS;���
%v����@�:�0�L��p;|h�0�DA�����W�0m�|����0�'P���X,eI���~��gs��a���k��x�^�z�o}�Yx_��~+%��,�C;s{�>jh8^�	�S���3�bo
(O�����?��b�>��n��,���6��&VK�I�e���Eh���8u%&E��;�K�C�h �l=�8u�,�6���C^^?t���lV]~�'�^����,�C����S���U�gw�m�������4<w(��%����EW������S���S�M	�	r}�b��������K���>�8�	�����S�&�R1<)�4~z��J���/
'���Rx��C��?������8�M6;#x�!�i��S��E!��r-V�847�l�����:u!�����q�'N�K
h3v����!��|O�_��:���7"����Y/� �
�9"coo���CP/���z�)�����By�]�+C��JX�N
LJq�F��'N=��v�!D�J�����a�+�x0�x�z9�t�����ge��a��k:��r,��l�y����[�e�#)���4]6���u"'f�j	���b�I�{��?TT����ze�����Cx�u<��� 	�o�0�Nd	�
��<p�q���/�Hj�I���7���C�S�j����8f��>�j��C�����S�y���Z���T���<[0��g"z��b�y�������z��"J]�0�^��v��*�S/o\�6E����uh�e�)w������z���o'�&r{�C�S?^�~HDS�����ye%��T���Sw��~-�g����������a#�������0�z�10���# ��lk�	�|����h���.�����`�|��Sy�c't�����x1F����!�R��:������J�t��1%���=p���X}��#�
�z�@<��]�������!���S��9����5WO�1J���BS�����U��QV��B�]���Q�b����K	�������$��8����6��.�f����{5�����[]=p�>+��r'��3��7�/�*Z������g�WY��H*	����##�<E���c��!��!�w�h�M���S���h\�(�����k�Z)��_u��3U~l�c���MlE_��+k
5:[Z}�8e�S��zGk�Q���P�6���t�
8�����0��qeK����#�z�Lh���r,��r���M�\��s,��a�n��
�z���3����y�5Do��1N�/9�����*�?~����hT3��@�
���mu]u����
�}\���E�aC�s���8u;O���Mc0���S���;��2��#�2�
�@Q5Wv�X���o��jC#p�'xC��-6��=#��l�4���0F��������/��P!6T�]�,F��?�:�L��C�S�)F�y�3�9�h��1�+{�����5��a$N}r���E��t������Q��`e��?
Hz*�1j��u��I5~���NS��C��=�z����T������S��+'P�I&�h������%�k���A-�*��m��X4��
v����L��fJ������',<3lh-��{��?�w����!/������@W��20�Dk�#�?&����h=�Wq�d@be6)���A0��8,��{���!��=�t�W~����c�W�m�X
}-�����|�v�p�-��5s,�_���X��v�m�����
4�_�����Yg^�deE���G$\er�_E����%j��wE?FJ4�]UJ�i5h����.
�<3����x�<+�U���8��F�8�,��?v��k_��������_����T�eP @���O}�h�:�O�Ud:J6�O���V�u�S�|j
&]R.*�2<9��N���.<3m�;2�B�8�W���@���VkG��
^��E
t�N~Wf�sUH=�9������dF��2�G��c�������q��u��uoS�\��U��{�����'G���]g��R;FJ����O
��|����[��EC��S��tH��j�������S?������J�o����D���Qry&vHq���� �0�>��JX���T)���������������5�@�~N����C��I��8�-N���t��3S
��k6�"�������N��_�?V*���W9�#�?.q�[�T�J��N.l��?��N=�����@?�����'���'C���J�����tI����7p�9(d[��![~g1j�YI�^3p�q$�d�7�,!��$u�x�V������x�M�R���,F��b��C�S;X]��k��H���>���0v�����(�f�<����IU��|�>��������b�f>�B_T��z�H>�nxf!�g%��s�|RXQ���JT���\���n�3;�;�+���|u?��b�����\	��K1D}��z�K���8��8�flb���m�;*)��{���h���S�f��N��3�*4��E$B�_���&����d[O6x�3p��u���t'7kN����88�r��2����)k��x�L���83��h�L�T��3)%�zE�(��������Q*Yz<+��QJG���g7]��#".]e���O��wEF�������I���z��W���$�_+�����������f�+�y�#H�������>��������
]��3��D�}�o���^�X#��m�W�E�o�8�H�B��O=���o�oSo����R5����qno��!w>>u?:��|��N��I���<��LYO���h\���������S�J�����[������@$u����?dm��JEag������(�JH���v��u���3#��T{0`S�H�gn?����qV�O}�|��8�h��L>u�@~UI��z�+�f�<s�k6�og������#w@��L�S�v�*>a.�gra�<��<�U�������T)n&��\��4�	5���>���>!q��m���b���O]X@�!��W���qAv���^i2�7V�6��W?����YI�70��d/	~mr?��c��ne���n���OX�}�S�Z@���>��|��/�>����8���VM2��zN�GJ}�	�M\��3p���x�J�tKW����
�?}���l@��L�z4"��6l�((��'��3�:����S�h<u?�r��A�]3u?\�����kVf_�b�7c�Ttw�����q!G2������d@^+VF_�l����E-~f�cO*9��1X���;*3�h���R��L��[pm�x��3pjC�t���hE�|���
����;W�,<���F^��6qj_�`�1fk&N�����=��8��dW��g�����K��6tzCP��B�Z��So�#��*�����T�����=Sh7���I->u?�.�>���!��y�g������8�l��
�;�'�����iL��,u?:V���*i����r��exf�u��
��������}~��D�`�����{+�=�v8��f�O����u���>x����o��
�z���)��@�j���a��V�J�����InA�D4E�V�~79��5�^�����1�����J������aE�<3c��l��,x����������8��!B�b��V�W��E�$�-X�l����������.�]g���Q�f\�e�����:�P�](+��w��0�Z�J����z�P�Bo�������q����8�������\�S��#NG%:_*V�Co��N`�:V����Y�P��B�J�z���=����J>��(���;�+_=F�:k:�{[�SO�������|�Bn3e��3�O}�
�S8�U��~���](nX1u1��
��z����-��T���H��W���C�����W���j����q��G���*�ae���|~t���F��F];=+��Q����\P���S3w(�L-!q�������8e��Q�6O��V�4��k||�R������U��|j�F��S��v�F�f�4n���CEY�Z����'~�f����������:����!p����`E?W�V�V��?9����
������;�]���7�F'�����hW����'L*�jF���G3Po^�W���W����%�z�2�z�|�O�r���W����R'���C�����GM+��9k�l~C��h����uA���i���Q���T�G�j��
��&��A���&N]8n~�#:�=�^�B���Ac��;ta�H�GZ��qzH����L����AU(�C�S�5�N���8������T�B�8��W�����u?��|���c�"��9��GQ�J��V9�Yy��7H�m���t��OO�|MgW�
��0{�����������\�7���h]���xr��8`����J|�$z�^>�2Z;'��@�f��m��t�z9W��D��2�����>�R�C�'��������fCW�r�W�S��z�I�A[�O}�&�Q��,��O�B�P�������F$u+n�a�R�����k�aoC����_��������Qk8���M��i
p�1�d� �����,���S��W��D���
�J}j�xR�����������4��ZB��}��(��O')V��P����z��~������x�OtlU�`;1�Fm~�:���kgL�����]aD�J����J	���k&N�*��6����>���C]���.�"'������%�����#6q��w�����8����Z'�u�Cz���S�5��qw<3b��<GE�`;��OJ�JT���n�25:��&�
�6��k��J��lN��9���z�;q�3gDn^���������1U���<'[w(�(6�t��G$[����|�F������v�~qM�������6��;�����N��Jd����UV��x2�����y�����������2�x����f������v�����W���FR;q���)�`����k� �U5���{��>�C����5�Z���E��
�b��Mr'�z0�i�����D�'��+1�b�s����X��du�?��r�������~�������^�s;�J��}����;��>�
`��'4��K��,�~m���A�Jo����]�����8��VmS��1u�\�Kk�8�/:.�5���S��{�����Z��,rN�mUUK]k:C���F��7��=�O�~�������|����'@�w���������H�{r�.�(cME\v�~Tv6�L�V�w�����\Y���E�I�'(W}�~\+�W��Q�U]�ZPzu?x>�������x�k��(x�����������y�N#�H�����;UJ��>�Q���2��f����Cvl�Tl|����b�15�;�(N�����u����z����
�
����?Q;cg�����C�������1�4,��.�����J���O������Sp
�o�S��b64���ae�e���%��g�~<�0�������=��j�/#_7R����@�"t�����Q1���T�D8��K�����{8����/��g��='����o�����EW|B��FL������AO=�Z���]��b���������3p���'�����C�S�31�~������c��Z�Ne5L:��*��>S�4�DF��Y��O}T�`��U���|��Wl��L-!��m�F��7���8�;�:����C����
"�����D�d#z��'��/��������!������M�3p�������0��������'0s
��7��eL!��S�Q��
��Rm�8�|����[pg^vq.MO����W>�l�;���3�;>r���%�2���'���2�PsLL��l8S����Z�'�z5_yV&>���`P3���������e�+:��'
����\xf����G�n4�L����Y�����&��3��{�,N2����%��%�(�������=k��6��L�Lm�y�}���A/�S�US/A'���k�����zv//N�Pv�����^�n�H!����������:#dRy����� ���X�Y~���N�s]���S���x;h�U/�S��s���Cm����
�����y�M�	�&}���>u����q>��P���SU�SiUOY��GM;�{I�C��=�S��No����1�����]�8�2�0���E�8��dX�KKOw(pj_�oB���d�8u�f}�Dj��?+�.���I���C�,���������1���dX]\����S�vx���zk��Z���2�K�~�_�Z@��;���C[�
�Ew(p�=0����xh�FJ>��h�^*�Y�2��+O�����oL��L0��Y�?<���5q�>�q2���J}�~)!��3_?�x�~��=����;��s�]E�$N}X#�����S�8���N����(������-�#=�)�����$��� &�6�s��
&�#z�zVR��L���H �m����3sT;_.�>�|���
��]�k���9������iC��{����8�rNM�"�N5�%p�u�m�����%$��(Qa2!;��|�����l�?unP/�S�F~�s���{�.,4�f��22�&n�����EL����Q��M-~dY��9�g���S5�t��=v��$��W��PU�j����z�oZz�kS����8��o��6u?���7�P\�/u?�v7���C1�����mo@?R��^�&���68�j�l�e69�g��l���3�{�So�!���
��T�����L���7pj;u{��t��a6��2��C"����L2��^��/(�����x���c,��pjb���[����
Qj��QG���S��Y�}��X��i	���6�����
:�$�2����P�~�A~<��~\S�9���NN�����
wY���.�]��aLX|�~<I?��6:�+���S����,�>u��
~�Xj���^�`�V��%q��f�X*oAx���Ps�CCs$�~V��'��(O�O�z>k�SW��`G���y�m���o5N�����#���W�N����x��?{M���1�q�5�����L���.�,��+�=�9<��m�=�.c���oZ�N��51�>A#�Z2���Pv�igw���������<sce�e������&��,�.�0�����h�f5��S����`]�vJ�Wyu?��	N�,�����e%UJ�{N���gY��9�����.T����M���6���D4��j�S����k6�P���Rwr:���2���*�����iC�4D�H~���8��h�?�f�5������/�j��{����@�Nv��/U�n@w����O}�BY�����l�����rD��N��z�����~m��O�6�=���+��q|����y�N��-b���+�.;�����FM����#���L|�M������z�T���h{~�p��K���Pj6�~�����O����b�:���������g��uW>u�X�������[wY���02��[U|h�.K����(
{E����r)xS��+S[X9K7*`Uo���e�����xh����P��R��9o��
������Z5��=S���@�J*n�����h��'���S�|��k�����m�Q<�:9+�Uc)�X�v���!�_>����>��JN>u';gJ����tTI4�<E�m����w�������[
�������������2��u"����|����^�U0����@����c�P�~��|����NU�E?����~\�	���C�~���&o���(W�3s��/NM��4��t"X��S��3
��+����q���q��9�2=��S����1�O��>+���m
�3C�������x(pj�~Oee�8u�w�(����?4�'��/Qj�~��=u���.�S���G�*%�r(�Vs���f u'�ae�e��g<���Z������~e�E�W	�z;����:����G
�d.�~(��������;L���S�}e�����W
������qU������*m����`}+ko����T���85&����Na��S�K�V�(l�/�H���~1�N��7�3���X�.H>���;k{�+��~���x�
��T����V��sg�jC��q�
��\�m��S?���|Hc�8�_�J���r�����@��Z{u?}�+U��-pj����1����������{�e�Jo�S������`�(jt�$�������
�g�?�S��}��a��{f��4�(���u�8��ik�b��
���+����v���8��<t�O 6�L��|�q)�����l�S�������w�L>5���sN�����S��'5o������28K�������������I�����I�]�pz>S�cR#�����&-pj��4���Uq��������
��>��`#�T����������5�3�Y1�^���s)&��nM���g��SU�ge�����\��3a��S�+�^�x������-p���c���68�,��Y�j:,>p��9k�x����R������W]u��]�)���}�W�c�S7 hnX��:��A��������+�����3u?N�W��������~��42���-���^i��]�N-u?*=u��4Tzqjvw�������/�	�Y�Z_�~tG��`�o�
���6��\u�����[Q4k�����d@��SW���V�����SW�,]������O]�8o�0T{����0�
}�`�-��m������[�~t� �z��Fo��Q�����c�R��r�]k�]�{&N�8��Rp�gr��~��^i=+��a��/��C�S��i���o�=�GqQW�j�_%l�Q��]}�z�r�b_�[��p+������3�_K�n�v���[P:d�O�����N�+v��!�7����9������.�;�|����]��E�VIZ������b�J����Z{N+5�p;���]]T�*zVR��_z���
���;-���L���G���T.�������~M�C��N\����U��������~{�
���n39�����^�9>VBE�1��������-�Y���g�C�Q
�{��L��*���uU�|��u�l"�X�:����oR��5[�Q�^�2N����QdFW�����O�/��N6"�3��M���Pn�E�
Q����;G��mE��a	����F�H~��/��]�_{�[�~����	�+k��
K��v(��� ��P������6�9��B�8�Qe�kos�����:4 ����G���T���Z������Y���)t<�������M���^���o���79U09��|�+��&�i��N=�!z+�|������siS�z�6d%c�]�@/UQ����&��d���!��w����Vf<D%�KM;�,u?:��������T�+@yT�Y��C��.H��^V�������C���{#G*WG��P���1y��|�4�-�S�������������
0W����[}cj���������p;��������dbse��&���b�R;f�����t,���L�M�����s�*>�*&�*X�����+���=�O}����������S�S�"�&~m��_����/<3f�4E����I<��[��
}����L�P���TU���E�+���hRivU�����c�"�6'�k���O}x�`z���>�:YC��y����nW�igO�������t���C���K���/��v�8�Q�q{��sa^�l��-���7�\*L�g�CF�����[��~f�H�19!U9�8��B�gf�Z!�����3C>����v�k����[k�Sw���~A?���2�e���*�LQV�85t'���6g�]E�#���7��u?,!�����|���S/7�v-�f����<�n���8��D��Ay�����Cg������G�
�w�g��B�w�	X�����s*����
S�����6�(E'�4D�����gep�� �.}GUHmhd����Xj/X�j_#s{����P%��_1F�z�KA��;�{qj��
-��4���WL
,=K��_U��S���O}������m�]6y�0i�z������qB�~�����s��X��>6vhA���N=�����p8�k_|��-x�>!pj?�z��6bRxf��-v2):�Y!���xt
Ect�%���[�����io����qX�kR�	wY���p~U-	��_���8��V^���s';|�QH�J�Q�������,�(�����nZs�w���)#���K���������R�c�����+~m�PS!���"-����t�5����7?�����+GJ>u����W��i��>]%�3��W���O}�
Vr�'V�����Ty�7n��S��Y$g��35;
/�"<3����[�Wl��O��s�Z 6	�zMr�W�f�6��	=�KuF���8ucv�s�]�s����������'����24U�n�;G�+;��U��;G�3
;������l*�����Da{��f��q;��z���Xp�!�U��L-=vJ,`���g�������a�����@������8��
������3������T/�N�|����/p�}ta�M��^-!p���9��o������ h��i��?�(vp0��L�gh6����]������.s��,���S�3�Q�H�M:;d{������6��������{��2z�x��~��8��zN}<v(lhWDR�>�����wi��I�����l��Nx��������>�S���s7��4��N��{���������V�������*CX��[�������q�z�����{.D��z��cR������L��~�9Vf����JL�j��|5��e���������'+�������K�2@��4��O������n�_8�o(����������o��qVR��p����z��6�w�(@�d��F_���+~:�5�R��N���q����,xf��k"�h�?�����j����S���G��T�zv���������o{�~��j�D���K�N�Of}OX_���p�USH!�K{����9��	��'N����S�&�';�������2~m��Q�m��K���w�cz^��~����p<S��	�p��>�,����E���7��T-`}��R���(B/�?��O�����;�w�}�.�[p|�e�y�k���C��0� n���V���S�
�U���U�8���^�S�c�+s���+����G����TZ#=pj��g%u?�����T����Qt�J7�W�,����v�/���G�R���Z��=q�������J�@L��~�t<=m=���[�����������s�{����	Pu�}+S�c�e��p������	��)u?���������z������[�H>���\X�e�N����2�CC&��Lj$�{%q�������s^GO>�A8�K�{����_����������`=�(,b���}mf}��{&N}PX]��'�����d���3����	o��d'N=���8�x|�����2���hO���:��';p�'n����h��{�Q<�Ag��/��S�����r��c���A���&iK�S\��E9h�8�7��\�!|f��N~|g�����|j�~_gG���SO���J���8�0��t�4������4��<�����?��d����1��	�
:��CQ���T�q]��OL�_���u?�M�1B9�S�c��;T�CaC�j�,w������vT�[�c�|&N}���2����1oB9�9�_%u���10���.��S�c�H$��&��x�~�:���Z��8��U�/����v����(�:.�+�k�E��-����S�,C^�{;r%'�b�P��S�c�����"O>��d%co-<X��cQ�Ps������
������������B�!Ea��� ��QW@#F��y9�[o��9�T�A���)���y����Qs-���S��P����qV���
��X�VZU\u$O��r)��2��Nz����r�lUE<���4��2�9��S[!+��58���;�8u��8��������~��X����>������g>�;�|�j��g7.<3u?�'�U/X��6���_�s�<u?��_Z#�fy�~����]�-+2�������(����~�2%du?I������$s��V�=p��W�
�J8��Vk=p�vT���J�U��^����D��p�P��S�|i�s��>������}�9Gq-L��<,>�(��8����.����Nw���6�9Gq�xM�fy���w��#)|��_��<��S�I�~T���_�����D{��5��/u?N|�����p>g�e��g��x��r�P9��O�0<�����u~Y�]8�����sk�A����aQ
������20�3�m~�!�vjg
��L������x�����]a��r���'��
L�k����t
+'N=��T�xo�����1&���i1�9���1
~����g������(x���W79P��*B������JE
}e^�|��s�6~m�1^(���8�����U�O8u_����x�!�O��s>�jua��N����������JG���2����S7�w8��=�0���8]|Z�D$����s�@�[���g�.,5��G�������B���6�c���T����H>�Z���5
�����{+��q��9��:�	���KG�~�2p�^�X�s�6���[W�N>s��^i��F��>�N��v���
u"K��nxf���B�Ko��8u�B���_s�8u%�9�R?4������TW����Nm��?j��g�c1���o�b���'�/u?N��zR������SV��)�j$���z��������P������H������|E"�������=��7�~�����'N����/���u$��\]���S-����E�H������9G�tkUzE�G����~����������5w�Aeu���g_�g/~���oju
�*�<�������u�)&5����T�v��G��i���+�Ro���S{�����c����|��V��X�s����E�D��S���.��L�J��������*xf�����wM�s|��s����`��s�*{�a���z�q����Q�k��~p���
��F������LDt�v�������C�J������0���#pj�;������d$N}|����P��R��2\��kF7���5���/�0x>�nK&�����H��0���)C�`�8ug���e���oE<"����O������=S���'���>�������:TyaN��5���+��������8�|�>�S�~M��	�t#p��F��������������V�~������h8��P�SVIf�*���hT=m�y1qhN��!��u��v)�w*�F\��������R\��'Nm{�-\�2�1�u$�z���?�?G�~��^/v�*W}��N�7�g%p�5��s���S�S9�_��U�8��7���������/�ON�Vl|N���}����+#p�u�f�$�t����`i��[�"�c�]��x|���Th9G�h��42������#�-]�J��L�LR�eZ)�6u?�I]�{x���p�i��3�����a*h��[�+��!�O����#u?>�����;V�u����������e79������*t�����������	��wY��g����i��H}�M�9�"^U��z
#=����O=v��8�B����~4����#pjs��S��Y�N���	%���7q����Lvqbe�M��_�E��3p���rzV���W��S�U�v�W�xN��>��_�������|N�UDtN��n�z������S??����[�������"�������bZ�&;fv�P�eS��~z/�.�,�8�u�C?<h�=������������?S��::%�LG��3u?V��&���|5���@�H�*�S��9��Rt�����Y��6��u�H����w������Vf�tY�!�`~;��E������3q���S����V�f�����X�Q��L>��DM3���6��SY������2���7������L>��~�|�\��e3�����
0F�����];G��!�� �bN=�5��h�9*�J��Q�@���8�.T������3�~�W��Y�v3p����tu7VF�b)��&��5/��S�B��9������KW�
M�������_�1u��M.��0�;(���I�q�&o�tf����`��j�>����>:�X8��f���wO{N7��b�3p�'���1�V�gO����T{���iC�������!�!�����b�'���v�y�1���+n2�����
0��5�3S�����wv��O����8�>>pj��/E}�������\���������+s^�>u��|�<��8�"N}�&;��
�����U�O�����M�U�I6�~�+�R����C��z�|�SN��������6��GqIN�`�6�2��3VO6�Rq;����~�z5�����aP�+�n���Z�k����KQ��KQ;�������t����V�N�.�a��*�r&N=3�qo����hL���4��h��.�-8���_%p�3
?����:�g���������}X�z�?t?�������L���`C�B';���\����w�=��|u?��s�~�P�~L��t*��CPm��S�s*���`e�C�>�\>3p�'�����`o�[;S�� ��5���,�����58�P|����fLMUE���S/v�d�	�����4�7���Ge������>�����c�z`|f�C]s���TeN����u��1:����lq��`������#o$Do�So[FK�g��_�6��B����8�>:.�v��R�2bjL��P�t
���lRO��e�������k��1��S)�q�6������^A��R�c����u����2�J{��(��.2te��K�����_���~��������~(v�����zu?�3�a	�C_��qp� ��+S��W*i@kD{1W��O�*i@��+����hc���v��S�ui8e�5���q����Z=X��G���@�������
9��vEa���AF�����.[��qM�5���o�^��C��a��=����7�<��W�~L}O��rD;�3�2�]�U��1���(Xjao��FV�~,�.�q:��V{k���������|��Y��W���Ve�=�f����C�)�x��
b�I]{����Qx��C�1���8���C<�������
�W����<��R��zu?8��4�>K�d�-�Z��������
M�=m+pj7�!���oyN=��Wi��]����g)������>t?�r_Wd�7��j\�<{���c��9����k����d_Z����8��28�cD����k:�
���\���*_����S����H����.��x���g�b�1=X���|�I-�����R��^*|��������`��|&��u�@��ZwX�������<U^X/��#�l���'��S;G9�}:��?d�2�Yb?}��2cj��1J���nM{���*Y��n*]�~h��wh��y)EV��~��eo7:��VfV�~������N}E�W����F���}��onv�"��9��nUr��bej�,j�����T�K��S�D�`�����1�b���!p����Sz���u?�2��WI�v���C-��� ���F��2_�e	�L����������pdt3mh�7���_X~�9?���:ewN=O�����G}W+p�y�=�����������uIMOL�Z�SWe������=#Z�����Xw�3�n:���>����,p���J�@a�6p�q4����V,V�~�K��z�����T)�T�����.�j5 ��^�����R��16D��c���
�zv����'�L�����|O���O}t"�F�`:�n%�z�O�S��N�Y���������=S�����=\3p�'���c�e8�/]E`��S'��:"cjcr�J>�)���K�+Z|��OV-���}��d'HY]?U�!�������I�C�����qOMw]�;u?��}Ed�.����Y}g.���N��A���e���;u?�7���_6t�C�i�:��O��$+�~h���~@
��g��e�����=�J��g&N}k(C�V��8�;�RwnbX�15+�WlK�����	<�������*����G;p�v�n��v�k��vp����_��;c���`_d��%��^��
��N���hTuT��'��u{�d�����~L����b���_����,_9.;u?�@��5��s�����B�s�b]`#nvT��R���K�@�\<���f
H�2|��^K���\�*���G�qj;q�N�{�����~.Aj�L��T�u���h�%��x���8u����]6Xw��S��j�`]bR����U��7T+��A����6����SY�4���.�t�@�x���� �?�/r����������T)A^�y:;��v������gN�<sa:u\pV�>ue���b��N}��	�������C���+�������%����\;��
{I����}�����+&��������R�]�9����8u��j���0���O]2������������n����[�6���GMm�
���W��0�#���;p����*k���������w���|����I��������:��+�{8�/���[��N�zR���@��8������c�]b;��'�|=e'&S�{��++P��oE?�����s��c'Nm�ct0��k�}������#�^z5����8uCg�x�l�1X_����:Z�~������u;,��I�{������>'Mj
}'N��DU��>�N���)�o�l����^�N!�B�L��U��N	Sm�8��~q�����?�
s�:����O�Z�J��j	�S/V
,���8u��d����~�vvS��C���4������'���O]��q�U����3�P-�J�����Q���Q}'���e6���.�3�.�����&�����U��Qi@
s���� ��������
��<�o�15;���P�2����z�x��VIv�~�S�9��b�r��~�����&�tR��3���

�6����g�|����F��Q��8��\]�ee��%p�����y4K�����ZuD��o���������l!���r<3l�7�����:������[R�;����&t"���K��k�v���[�3s�5�u�x�%p����3�����vL�3+�"��N��l��x�+��Y]����^^}j��l��T��L���W�J���~m�!���9t����K�~X���&������^�R��ge�Pc����H�������U->q��5��m*�����Y�P��+���8���%$Nm�4���Q->q�������UR����Z��)�|f��Oc}]�J��NFNGm���f�~�|VT��J����i�T�"�����}O<s,�����K��m+�������;doL-q�#�w��zV���&���q��N}���)v�I�
{���m���k��{N]�/��9R�����`v���E�%7���34A=+)�P��P��f��2f+��h����V��P����'���us_B�L��L��@����L �0������O�|u<3��;�D����g@��<�	��S��2��N
����8���c@�:`��7.�58�)�q�A=f�[W��z��>b�u�W	�zc�*�6�z	�zm@��J}���`�:(#��|P��2��D��6�@�uHTmh!LM��Kvf����y�a7e���t*�%��y�e����!6�JD��j)_E�7�K���G��>�[��P���5~�\z
" J��Z�2Bl&��MrX<��U�T����h8+�>!��=)�4�5��@����!p��q����J���{�]v�D*���
@�L�t�|C�����J�K��C����V��Q$�T�Ou�c��xI�W��i�J�5)�a*���]K���P�M]
B
����ht�pO��YIB56����lx� 2���=�.7��h0�	��o}�ae��oC���w=+)�q�id�
���\	�P�r��������S-~��������)��.93��pe6):3�m��@u�*�c�W���!�L���x#Pamz�01/9H�>,��(����i~�z;�\[�.K��Cn���W����8����"�L����������iCQ����mgbV�9B�M{��Xz��{IzE,�����0���6K��nm6��rwF��1�P���(n����Y+]�J��h(?���+C����;�:�������]��Y����B����X�6�
�kJ�K*44��5�j�*�y�T�U����2�P1 "��jR����;Is������-���l��$N��uoS�c���j�V�WS�cq���NRA�ge�6\� ��������/Z!�6�c.�9aC)���)�6��U��: �y�ge�?N��G$?9��t���V���n����Q�I��u<3{��UU�)hRW[����i�=f�E����I>,V�Zh`A^�������V�idc���X�7*M��V�|�r��8���=�cq����UO[�����hM�{�5���J
7k`�����Z�)N<��?E�����=#5�LvgLu~��������x������Y����s
��W#w����xM�L��'�Z�kS�Hu��5��+�SS���I��e���T=��b���lxf�����Z�CU���K��������-<3����r��i�[s��`�cA���<�T=/�������ZK��8+��q��)�J?������W�b'�a��1�/���kr������(�XQ���+��uw�����2i����6��{h(t���u�lg�����p�XL�z�^=���o��r����8�#�)F�,>�?fC8^8�KS��r�9?���CJ��[E�UH�2
jjT�j������ld���M[�{����1#��P�\<��J|��������ke�T=��M(:`}��1	h��!�IN�1aa��r���Yv���>��������H@���YP��XM��}:�u�?�[��>[����hM��R�ik�k��^WV}��l	k��^s��f6x7���$!��m�H;�T].Z���V=/q;R��V2kS<�,m�%}��)�0��5f�YC�����dc=�9L�!�P��/��jc��O�N>I����=�H��
Z4(W��_�"DF����;I���kZ��Em(���W������9�pS��^�����X�����b���{V����FB�|<%.�$U?��@S���Y	�z�1RJ\B���Y
����BOY��ao��q	��k�����dFVgv%e+��clP�Hu��-�)6�q_+7V&�qM�&
����C�
���L��$U�qJ���m]�����p1Z��9��!uU�%R}*�xU�T��ge��D�)+;~m�)��]�Z�U�9�:�C���H�<r���$i*��Rm��QJ�8'�R��D��W�a�m���)�~-�)>�2�04mI�>��:�g��f�2���nS�2����9�U�PJ��zKR�f�c�(�3�C9���-��S�h{I�W�+c0�y[���6�������gf<���)hJ��7�@M�2�������7<3�A�}r��n��������Q9�P�O"�i}���w�}c�3a;=�T�#%�$�����z���z��d����^g���+�:��g��O�#�{}(�����e����P���5����z�-c�������&�-��v�*B;7��>�U��
�Q�0���������R���{�

��\���&�? ��7��|�)>g|��r�����8����y�}��6DV+��;�S�`�^�G�29D�vrH���[�����6� �	�z�V>�!���)��(}
@~i^��v����A����8Y�1��g��!=s�t���q��p��3�.�����wN���y�Pm�k}>�cH��_�
��@������T����B9�:r�b�K�����H)�q�����������#���'(C�����_,�+�x�
_l�T�gKNu�F?N�r�
%���Vt�yh�n��S}p����n��T��L0����R������q>�C)����7����"�'R�gH}���^N���K	��:N���+�5���������6S\Ovr������]���l�@�{��U�,{�9���K��]%8?W��@�>���PA{��������t��!5������"�v���2���P�b�@�>3���^�L�/�)n�5MY!�@u-e���1��L�D�n����	��H�]=7�rhj�+yh��L#�_M!�jsE�
B1�
T*���1V�
���:;[ ��\�1��"M
�zm���r�T������T�w���~~n���Z�x�g�kF�~|��O`���I��ro���s���r�f���@��Q)��O�>�������!�J����C��+#?;�P�8E�~����r�����v�x��f��]i�5$j,:���EI+�]�����V^C�S������!�� �+�l%
�I��P����W}
i��N�^����o:~�����WM�t�<8|Z~��I���,��?:����d7��=m����Z������R�(t��@����`��[�\���XER���!=d�!���1f7T�gSditn5�����l1�A��=m�nK�4��y�F+��h�o]��\�x�1��Sw7�s;��3���%x}(����V��xm�R?l���������W*�Dx+�k^?��)���B�f}�a���h[��?b\_�.����2m���c�3�k��]��ND��Ay��}e��&�X���L�;�E�;:���V��[�����M\
>%�k;���i��h����t5����Q�_%���`�W�G#+K����"Z�@�W���	hO��
9��h	�6!�Q�e���:�k����M3��8��4k5:t���Z�/�X��@i?�L����g]_�kK���RFC����2+ :-�C��X��66�� Y�pS��"Ks��P��%���g�4NS��%zm����5"��D�{3�-�1�zY��O���D;C��h���'��Yw�@���0���_�Pz�����iC�>�x������~�`����4k�!����E'~mdi��!Y��5O���z=N+�
�#n\Z��w��ZGf�
%������,�o�u7��z�y���Ja�w���(}��*R��R����7��-��u��%��bO�Y	�za��	�dRK�z�u��@R�d%z����|B��k-F5W�����-`�4�3���^[���~�b6X|*����t@v��^wWR�]����6i�]�T�R���"HW?t!SC��R�w4w:�2�$�B��J3��6�\�;#�u%��T��_��UQ{AF�t���6������4���2������������{e�-���T�-DRmR���u%p��S�b�����Y)�RW�e*��/���L'�k������W���s�_>�JWw��(��OK������+���������B�����N�IJW��.�Z�'�����:������3�2(O��`|���e���X|&2�D�}��D�:d:���} /k���U2��3
��{&S�nG��Td�j�X<U��
-���>�&���������z Kw����94���(�Jm4�FH���V�&�����F�����Vjz��|G�<S3��,���jc�p��1u�X�_�{����� ����g��a�k�
�+7�\=�)�q�h�D3}O=o�;0)7���!g������RO='k�L�N�@��� p��XB��,#2�Wt���2l��;;Xb���9��pp�����R=�]�2t))��[��+'�Vw(�j��q����������;����148Fw#6Y����b���K��N�k�n'XEU��;b���5���w�P�N���zJW�E���~��#�|��	���uFE#*��S�e8�$�*����Rf_��z��c�2�����#p�o�����.
�=Se����������S?��{Q�w~����S�#K������id|#�����S�#g����]q��8�����J�zV�eBa���R���u�F��A�U��=pj?���k9���PA�����C�;������P���{TN���r�:*������r|��Gs�+�O�8u��f����>CZ������8u�@H�lH��dY�q=*�����2���������w����_B��=S��
�G|��9b�`���r�$���8���4*���T�.�ro'�6��:b����:RO��)cS����*�@>�;��/q�'D�]�Dz����1
G
-<���fL�_��N=:��>���2��Jm������N=.q��c*N�S��ZQ�W�[��}m�'��K�y�.^S��0��{��E�J��O�	)���	�]6T����*�N=O��"J���.G,NG
���������#/�(���T9|S���6�P��s���)�[0p�98�� �c�8����K�gIr�7���tK���EpAqJ�!g��� @@"�*�*����|p�Z	Zk�Zh���Z���'���0�����e=X��I����p�����;v�#�lC�(l1��w
�m�V��b8���)�����fC��6��v��bq�s��������1K����kAlB_2�t|��I��;�\���C^Z	:��A����^'����yO�H!�>|W���(O=���U�h�N�W��Z�2�W�nC�n�U�����s�'r\p�:N��gs����-���tV�i'��a8N��fU�*�
�����`��Ii+�jj�|�a@��(�Q]
$��[�����0�;h������uq?��"KsZ�#�9��TT�_��V���:�
Ey���e��y>N$��0�3��k�G��5EGZ�5�T��L��-CA��S�q_���vu�HW�a_���;t����z��-��������n1�+[�����wb���b��'
Y���-�9�c���+;}"H+�S
��u_S��j8�����\X�}59�F������e�[�ZO�O@9@uB��jP���iu5���f��K������@0r��sZ�>M������w�j8uNz�;F�j{$��bj�jvY��<��N��Nl��n�O�dE��T��g$g��#��m5�z����������@&�%��}G���sq����*�W�������]:�������56��O0���eE������9h��\�����n���G�6�9O>��?�A���������l���L>��K�j�������
����W����sU;jU!����x��!���-��H�~\�f�����+w9�j	��%ur'�S
����x�q�7���p-<'��^������QM�H�M�:w��U��k���]*���P�?��������Y���	���S3��e�=
��`���#��Q(V�Sge�j���!
����W���C�	xo�p�:���^�����[��@�E�j �3��T6"c��7
���Xb�����ZXY�O������E�zC���N����u��u>u���6o��J�{S���~���+�Jz�Q������Q��������������L8w���Lff:�s��a8u���:ds:N����4$7K����uU�f<��K��Y]���>�'��N��1u���h��C���?DNa��=�s?�����'�������O`�B���3o���RuN���R4El���j8�Sl
�9���s�[���+��]��i%��q	�u����1����jug����1uC��]z0�S��;���#���6AuF��SO�5;����������X����:��%�W��q���R�eR�
s:������HX����J��r��t>��S����Ex���&V�)`"�p��S��L�4�n]-v�����^�\��G����
!5��[ �Uq���Ns�z5�QM" hI�g3��$6�Rl�+i|��VM���7>��e#c�E�@)�������TFV���C�#g���p�^+��E�t4Jm�S�o��a������V��<�V5��6�9W��a�V�m3���t��s��fNy
��l��N��*�|���H���@?J���TQ�=�QR8�����y{�c��E���p������R�A�nC�M�:��	�S7�?��[��N����z�����+��eX�6H3��q�:�6cg�y`3��^�Y����s�Y6+$�:����4|h������::�+�S��m��H����=�+�
�@_���O�N�����9��1{P��@~������::�o���(NF��^k�S�:q�)�R�����E���j�LT�N]��������|�N�T��;`N���������T[�|��������@k���_���c�H��
jC�SOV���������9��?��S���j�	������\G�sE���x�u��'��t�x{���?�y\�+�p�V������A�*�p�����`t�4��O�Q����p����#h��lm3��nm���xE�o
����'P����U��v`(x�y�����S��������u?�d�����9c��5�����b���]�
��w�����2`m�Z��V��Z}L��lW�^A���W�8���W�ef5���������;���c���5B�#-�w�M�o�y��C�<�^i��5o��XwzGb�����+�����1:�:x��U��2���f�E��x��SklR��rl�����V���Nj��Ks�j�"����J����KAK|n�`�S��p��������8�d$����a�wWL�*�C�q�����#��qB�|���yj����kN�	��Q65py"��p��#o�~���bN28��O�Xz��w`VD�9�z���&4+�N�:������9�^��I�L����6�^pw����7q���;5E��;0P�o�����F��C�`8uI�G!?��p���V�������
yw�kHK���!�`8u[wX�i�:W�+�S/>5�-���p���d�o���5
�5;��hq������%�_�������>���S�P���j�~�n8u�l�)i����S��~u�#a��szm�q�B�"bN�t����j��AU��N������^�N��_��[:�4��
�z"���4���$�4t�u�S/�_��z���9m(�"���U�6w��n���VP���z7����S�!c�N];;:�oM�Z�n8�=�Xlh�s��S���i@I�L�����>u�B�
x���"QcQ��,�����q��ut��h��@@m��H�K~[}��<u?*8��2M��p�{MU)a}��&=�
UhE��gw��������}r�s����E�a����Ak�F+�"�=y�,�{F���������2mz���SW*i�M�����md����t��{%�"FX����p���c,W+�����]����D{7���eVqb���\����jU�4��S��a���Ls�:����I'���	��
�����p�:&�_�R�����C���D�~��+���`��j�v�����ld� �
��#���x�e'N=����1X��B�:b��|6��wW���u�aVZ��wW<�6:��g��Z������
�z/��c�\��HT7v����s�9�����8���mGp�{����\�>��v���2�xRy������������������!cAQZ_ujd�o�����[�@hS o���\�����S��"��'�J�}�p�������F������T.
�e���n`��8r���5�j�����q�����0��+��vC�-��+fZB�Wy5�9>4���o�1��ut��:*��#����V�z����~Or���������O�	�{�5@*$�H��
��S��u8��O�z�
 D��n���4��jI�����5V�I
��xj����j���f,�j��#��C�+��Y�HmqAt������O���<���[���u�����2����������,��������h���Z�����n������*_`}T��sz]Gh����6���K#G3����xN��NE�FJ�j��V�Lt9b��6d8�}5���1��sN}xSd��p?��C:2�����Y�&�Y�������{�CU���S��Fv��.�X��C���f�oWX���`����1���������t>�RyB��z:8N]����9����1o��y����Y�'��17�O]�7N����|������o���L�H����8�����]B�`�
}N�W�hE"�=���S���"��g����O����$�w�o]Gz<�z{����9�
�?;r�C}�0�:G�����Y:�AGN]��������{��G�8���sz��VC~_�*������&�R'F�:���=X#�!��h��`	�����2r�'�,����&��n5���������=��bb�m�\��|���������iN�V�����E&�4?T6�tQ	�=�S7�cVe�����7�n�s�|N�C�������NY�p�aD�X�C�S��|�
m���g:�5�'j�3�:�ox�t�e��
#/�e����L5���K#G5=q��\��e�z���2}C����N��0�o��b�"�������(�*>�Q$�WQ�q�����Y����+�8�q�d%�Z��~<tq��j��R����)>����9�a8u[>[�Q����w}�Nm�L�_Ey���me�uN����u���)�j	���TK[x8N�t������zj8N�{e���+���S�i?�=@�p��tf,�Ul8N��JP�KLJ-��\�
cdd�������y�n���G��v[��4��Aa�����r�p�{�R5�Q�<a�>�L���������3�O����s�m����P�S���Y����p��m�9�b��������T�Wob8u
��$����p���`>;_���S��a�T��r�~�C��]���p}�������o���S����%m�����B��rt����W����z���S�U�����4�S��F����0��&U���-2F<�mH>;C`N����`u�t�����{TN��~����9��9��:6Ub���>ul��&����8u��6ut�!��n�l���H��P^�����8�]<�P�GT�|�����B0�+�S�l���;P�S�c�5��w�Wq���r:<?�^w^��X-��c8��k`�M����S7��F���!zs}�����I�H�u?�r���\�p>��e�+Q=�N�8�����y�[��s�h|)��MJq�p����4G���S�Lb��[��Wg�r�\�|�v��
<������<q��8�J��N�����\�Qm:N��\�v�]�������8����5?�8��u�L�"K38��5�����t��o=�:,>�
�Y����n�P33
�NC����&�^A��8uQs��T��������Q�R}�t}��;
���V�=q��H���w���KOJ�d~E���q��q���-�q��T�J	jbg����1���Uu�s�x
nl������D�I�
�i�P�Z�����~�@u:(
�����W�L�VE~��t:NY9��V����S����=������4�����e�VR��Zz��'�����p�T�&�-^����^��p����8�v�}e�>���_�������z�-�w����ac@R�Hq����K�u^�x��M������:.'��o��S����J�s�f�������g>���7D�-�+�O��d%���F��S���~�*q&�=
��Q�c����,T���S���@��Y�e������D��R.�!���.�����+�����Y6�QwR���|��������9��
����4���>u�`#R;���dE�����#����\X���G�m�3����5���uQU�E��i8u^�T�Z�6�6�&��S�����*J7�����8u��`	�}u?������iS�B�:��p�Y����m�
a:�:gZB���1�O�6.,W�Z��W�I%G�gO�������c��p�c$cM��g���P@���|6bM����/��v5�7
�>���zF��cN�o����[\;���9]�#T�#%���7��1����*�
��S�N��l~���S�s�e��,�.����C{�n}C�S��:�	���x�4���xH��q0�i8u����U-Ps:]�c��S(z��o��sj)�'��`�s��������T�f�|Y?�-N$��KW����6uN���.la�K��t:N=X[���M�q��+V~sW��M�SG��Lp
Q
7�s�X��\��1
��y�P����p�VKa��:.xN��_}������P���C�x^���B�i8uZ�q�z���R�U���{+_��q��~/��}������5��1����Z�S��3{RK#-������=���n�, �(�'��?��T���%�S��"�����t��Mv�a�QxM����4wQo�'��rqRI�t9]�l�p���y}���n�J������������h:���Fm����Z��2��j$��H�+]�@��nk�s�j�u�����#=�Ae�R8'F�:o�\����Z��>��`~��U����Vn:-N�6]�����U�S�M�����?Fz���yRW#������U�\��Wl�������:.���s�YV&�sKx�MFR)j�8DI��g�f��f��6�>��S��&r�70 [0���o�c�P�P������n��9�l�1�y��y$0�j���s������#����O�����SO�&���#�xT���F���S�>���>�Am�p��;W�m�jC�>5y,�!��F6h='����O���%����6�!���`]��n�����L|���*��\��j��kN�R�7�,fz��!��Sk�N�}q|��5K��n�D�c���J���������~��j������4����n�Q���|�NN~��{:��R���.X;�t-=��D�<y
�n�}�2��5`N�/���	��u���>5���8F������Z0�b�X'�#5�!����se	�5��?�M�#;�#�n�6�%`��
�����B�}�*E=����
 ��4N�U-�p��.GGl���&���M�9������c���T�M�����	���
%
������4*���M����C��m�[���O���H];�������`]������b����x�����}���!�O�|�g8u���T�b�8N��2E�67�����7I%pC�i1�d���17V�<����
=�5����}���7�m�Z��~�aC�jj���|�UQ���m���m�L�JCaB�-N�s����2��SK����<�p�>�{�l@lb8uo�Z_�j}�S��9'5�?�^��Z/��U��p����pD������#���J��4���^�9��)b��u_VyR��l�+�S�U�-�v&e�2�:��W�����e��io8u����Y��v;<jg������2�:"xo���S�"#��pw�����D�]������[��O?��Rd����}s��H1��e
qBBn�"�4��W�G��^S}����c���x�N�3��qP�o�ua�d$�^�8W�.!j�{A|�!-N}� �W�&�a���3�@��}�p��Nn�zC��S�X'��J73cN����sB�����p�cr&��I����Cs��.C��f����<�P��P1�Y#^���+�/:4���8c4�����$�U�q��U�7X}LO���:�5Z},8���,���jp���g�.�N�:4�P4��/'���q��1%��PH"`}���l�S?�UO67L���w��Q@���{�j=�A�Re�J���.�@x C�9�nN�^nB�?B�<`�`�R��!W-�f�-`�z�[U�uA��@�%o~���m�zC�%����,���@���O�9bd�sZ��4P��"��wk��	���y�~��fC��z�j�]JC����i���w� ���I�9X�����Y��W�h��(<�w��i��=�ZxN
�����	i2
EY�s5��5<�j=�O9�	{���ZSS(L}Bvr~@k0����]�z5t�����A/��*���q2��:F�%?eB������Q"<�J�8���)P������5�:'6c���#Vk����#�hKXAG�
�T)�q
b�^���Seq��lhltw�������F"�r��ZRx��[i�s�~��@t�Aa����t"Q�H��
���3�y��D���Pq���O�`~S�� O4@�b��4R#��3�{z<�Iq����>�/�	eTA�����&�*&,��@ROB���U��=�'hZh��S?d�u���~	�5�hT�$Vb�.2�����Pu�@�:�{:�zQ�6O���c�����9�H�� �|ZK+�v���{��S�����n]d�m
��[�g�*��6��b���~��I)��uDRN�����0m
����L�{F������]�Z�����U�I���n1��.0�e4�Ioq�6�f���{���HtZ�K,M2�q�[�FrbuH�+[
��qy�%����^hv���55[1"�1��w�z(���'z
:�z���j}Hc`��B�u����p[vbu�TE<��O����;�z�6)m��vq�� VC4z�`�F��m�b�l4�`=b92�=*z���C�*�Pq��~��#�q��:v��^1�z�m�lM��b�uOc2
��TI���'�8���!%�H���'C
%�%�{'��2k:��D[��R����.56%n*n�I�HX��TJg#�dB��dxu��=�.�8O�U,��e^/��F,�H'�gr6gR�B
'>�&T��J�9]dyj�N��N�4?�dY�5y�{����fb�M���
�A�M[�#����N$�:F�*�jl��q|r�\x�����N]��d�q�%M�#]D�i��|7�d8uE����TJYI.T� k%'%��{N=1�	��czC����1)Au��N= 4
*��������G�_�p����v�&i��������j&n:�a��)
[���n��e�j�Y�����%��%:��\$GK���s�@��W���{� �T:��a}N�^�����f>=��`8u��j�Z�F�U1��8�*n�,	��
��7�$t�,�9N="���o�n��*G6Hj ����P
���P�3�z.�a���,��4�z.����7�d8�S���8�4?t8\=y�7T����jF�����0��@nW1U��=�y/C�4 2F"=N�S�^I�3�[k���+J��RV)���n}@��7��R2���3���Dv/(6�\$�� 
��b��*.�k)��S������T4z}�e�&.��m��o1�H������Cr����������l�9��l����s���g�l/,�@Gn��H�@ZCynCIw���Hh=��m��tbu�(m�5��.2�#J�Q�"�� eR���,�"�d8u^�q*1AhT�69N��C��o���
6�^�����Y�y"�\���~�tG���� Y�g�d��	5E"�)�����S�H�((�q3N]����[QyF|k8uYB2g�\IR�/9N]�xH�H,4�;�H��;�\�[+xk$o�P*�q��B��6�!f�X�����9�D���>��7t�Q�I���������&��Ii��jU�'/��*�����1>hEu[�P;��[,.AF���@����2�����R�R�@Ky����,C
�(�u��"��]|�x���������p��������U�:P}��m6�����35��Wt
N���Wl��Vk6��@�z_��8P��"�������0�>��E(5��B�5�7P��.A���VS��!V|����@�Z��
�n��,U%Cf�������������j	��������P�
�Y���1�����
(V�a�����J�DomX��a�\Y�t�Ux�M�5��c����jV0�����8x�T��4�����1�Ud4��w1����
��s"	�A�PDv�I�Jf�G:;?
�l�EX�w���
�:�/�T�0I�������TWQ�sB��T@�!���1�H	Cu����F-���������Qr}�����^����Fv���	z�~��H'TW�[���0�+��nh��p�l
�g������ u�*Vk	��*2������x�CI�y]r�W2l6��'��Py\�>��-�#RmC��l@u��|�J��&�Q1�4j�����I#��]���T�"�R6�k$N���8��;\���zbMh�;S�Y�=��7W�P>�2Te�H��Z�+Uw�9\?M��$��P�DK��=�P]�5YW���-_tTd_D�2
ge'T�qDT���}�zFj��H���%�gs����sz�}��f7USn����ug��I�j
���	��u�7q�MA+��M��j���)(��@jB�������
 ��[�KMcfW���#������-�����=��8IsZ<�b���\	8�
�n��FA����Yv�#���t�'Wk~�h���7���b@uZwX��f�xh*�"P]VG(��"Py�N�n�6�-�T�&���bD��
 =�F
���@V�@�+����+����?y�a�����~+�
xCN ����bR������9�������Z��q����2c`��i��O?�9����<P�&u���?�IB#*�N��C����@m������N5��	X}���NY�emu�������HSD�QF���T�����vY�@��Av�z�N(A�4w����2UGk��J��t���)IO^�������&e�O��Bm�I����}6D�=��yG�H����
'��rN���>'KG7�.Vk����*�N�-;�z����5=5|���m��jF
*���:2�n�AZx��PHU��R�����������]yG�������V�idX�AV������;*6�O��K��+UG"����V�2���������*��&�N,x�]����������'�����Hf6�UoR��bb��,�`MGf��o��C���nh�?I���^p����f�&�p�Qrei7I���6f$�����T��]02h�8N��`���,���+U�21����������3b�T56)�S��8uE���:��H+R�j�
����9�b8u_��J�*|N}�����o�U��b8��tv&O{X�wT\7����n�K�`7<N&����v�,&�w���uG����E-�	��/�M�a���8N]�����=n�#��j��{[�Vb}����bA<��m6���{�
���~��L{�R�������~���9���hIAX��������F��O��%}M�e����>�HJ�+.��:/���5�N��H;�����F�xN/��F����8�Ci�d������������,dZ<1J�s��> �	���=��V>S���2������>4�A5����3t�����%�N,�H#�A�o��4�T9�l(����W��wT,�������S���u�K��.�1H�[q�����Sp���������t��9��6�I{������?&e�:i�z�(.��z�jI��*_\�cq\�;�ePB+�S����q:xM��w�s��X\�#�{���)�8N=�Q��8,��?b���>�qj(������ilb8uNQjA��m�N]W\���U��o8���.��,�Fg8�X�S-�e�8�
�>���?)���z��M��X
����3���K��Z�C,��O�j]�Cs�kg_�`�Qql��x��v�t�rD;r����R5�eA���'T�$�veIT#�no8u����tdSM�����z����v?�r����l���g�d-�[���mh�2��#�;pf
0����U� �E��)���7�^��N�^�����}N�^e�*uC^*�T��S��!�u���9u�[L���q�2������t>��
����!��+��hg�{�9ig�������f�s:�z�?+�.��9-���|����p���%�"9���+�`���Y���O��u�Wkj!@q>����`�p>����W�
��S���7N�#�p�S�#pgS*�����_V��������V�A<w$��[G��@�1$}������#Y�'T�S�L�T��F�H��
'HD�(�Q�����P�5��9h4�;�
T����>K��%�����i
�AK����65���S���zy�[���p�)�
�����t���`V:)nR]�#
�:������Gc7��1d�����,��X�d���c�w�N���9��V�6�(���.������C�j=|C,�*�*.����p[VY�����\��eM������������4�z���VK>�C.��5JM����j:m����w���,�J�0��D��:�z1���Q�_]�wTL<�*��Ea����2����h��.@4����9�����nN��U]��S�6J*���8�$��l��C�q��
@d���Ou�z0�D�2�z�NV*6B�F��p�������]b�9�����C�\V����3a$�����z�4w�&�P�n���r�
Y���/�p��3�A5�M����+�����d�&��n72Z�
�:��u��6�Qw�wTls����}�`��eK<��9����2Fz]GD�j�}����euR�JC�����*~��<����zu��61��S#J������o���S�9���u9p�N=��C���#�^6T�n3H��V�X;[GRCot��?*��4�o��"����X	���?F�de��Vl�*fC�M��cT�w���v3
�TTGu�����1���q�������sZ��)+z�h}��a8uN���M�`M����%Cxr��Ee#V���@��eb�xCfC�C��%�|������&��X�p�{��D55�y��Z��*cI��]�+K�N]�
O.)
����T���t����"��?V��H����[��U=�?f�G���f��)������vm=��'�zC�����Z�,�v=�QM^
�~������gG�@�2�rqw0�z&E\�)��������(g��z
0�P5���8��l�������w�Mo���
��Q�j@�q�+�$�f���)�Q�6�5�W��~�E?dQ�:y��q���T�F�a�)���R��?���Ui	%!��R]�c%]���`��c�N���u�zh���5�P��V��~����hJq�h����r�B���b�
�B���x"d4���f
��u\�_�����6���)���b����`������X.U9V�����U��t�:�{���s�������t-P��(cs�����������S���"	&������4�7����p�3�+@k���e�lj2�oz�7gT/��Q��4:i�T�U�%N^q�����tF��@�p��>������9�zU^���arN�X5W���1�h�xN��n������l���N�&k�0��w���H�t&#��%d����R}|��L/1�N����8��k��fHu�[�FV�+�R�����W�,z;�?��6o�w����?���>��9���J��S?�Hu�:En$�%g�%dvv��y+�%��������������'������NE�����e:������>A��u;�\�#��c�Z����37��X�'��G`�`B��jX��Q�N�����1���.�:bG|��l1���U��A��H-��w����i4����1��J1"{�k���z+��7���'8R�YgQ�[�O�.�?8�b��h����E���B����q_��6�^)e���P8��R�G�T{��5C��P�Y�J���Z�C��������S�##���5�j����r����Ks����=l��4Z�(��kR��f��z|kHu���>b�V�+����:�TgT�������%R�bd�$q��,�'o@�lV��y+�F���`U>b;�?��H�?�����-@�b0zSKh~/K����S5��!���0�{�Va�sZ�lDo��� Fk��p]�o��O���T��������`s��Ug�jL����]�cU��2O����gT��V��W���	s����mHu�o��
������d�Bu'}��T/�L��;�'OAG�WkD�����>��3��M�
*(�iHu_���B����w�����Je�tjP8���ZI�\�c2����h��R=B�z��
T�����:�\�����Z��������Q����v*�JK������6D�����DL��X3����GL����7'
�6B���m��[�@�%o���eiK�l}�]�:�?� ����W��F%4���O���U��In��W�M�'������ck�8���<Yh�d�����WN�����:��;�3�#y�Y���� b��������?B��e��9���3�����So��p��;Z5�<7�t�}C
U�K+��Jh����aN�Cl�>7���#]=fk~V�ZBnC�6J�
 �HW "�n�R5��l������XtgT/%���G����Q���t��;�t��I+�������OFRo�U�t�u����P]������:b��1�V��Ng�HW9������sZ<+�e�5�2�
���q�f�SQ��n8uN�����v��4�:E���#r�[����������FA%]7�z,m��z,hX�]�cq�����pZ+�]��fx�md�H�C�p$U�V������`�W%jiO����Q�}����������r�\NgT/TUY��	��t�z����F�b���N}�`�g���+�S���I���b��1R�7���^w��9�m!�V��O��A�N`�v��s��r9,
�4o�
�NM5	T�+��^<������E/nC��z����m���b
<�+<��C�3�+<�����������W�0�+e���	_����j���o�����z �����]�#�
j�e�w�����j�	�d�Z�n8uZ��+S�NaC�c�6����n�N=�d�;�Y#zs��Qk�8?�Q"+XX?��x7�zl\��Y�R�����3�����p�1��"-^-�p�1#�mg{U���3�'5}�N�&���?f
�+���2�zB/i��`�����U3l��p:�Bu�����[��v�������kZZ�Oi1u!W ��8��Q=3�lT�OD5�S�����R�C���bb#���>�p�\���H�v���9]�avV����l������������SO�8'2P������Wc"JM�<�w�C��:l���xw��9��$|N�tm���Y���8P�A�E�������*V�n]�
�����p������e���S���K-��s��Qii��1�1���[���i>��4��p��j[�c�nC9����$��
��q~f���W�a3�>x���sB5�=�S0���-�����	��E�j����������[���vT�+�q�BDt�m�����2���g������Ls�`6�]��y��|6P�	��p�z����j��F3�n8u�u���2����O��5���>����+����$�m�s���
}���#�

F��hJ������o%IG����CT�HX����~@�I���p���vz��*\O�N?����5�#�,[��Z�T=b��S��Uc2�-i1uau��RT�5��'�J���H��N���m�)���4��<�����21�������tK������
�m�����n�^�l�J�����em� e����#�w{�y:h|;�����
`Ew�j��������.�d!7kN}�"+o���t�����^���Z�C[G���a�S���������������m�i<��YXx��B������XQ;X����#yL���}�)����X��*�a8�L���b�����:p���0���*��V��4�:��lm��P4�s������:���|�I�7���3{�c�57����-�OA���+T���W�^)nCsS3������� ��,a�N=�V#�X���i8uF���n���+i��?�1W��zj��g���e�vk8��C6�Vk�P�.kD'��3\��c�,����=j�D���6�.>�
Vk8u)�<:v2�Q�T��sW{
M��
���~h�0�[�+$�jz�:���+<����^Ui�S7��
�(
W�X{E������W�h	�mN����1�shS�P$b��G�bj�����p��7
�"������i���]�W�N����a(�h~�i�a�YY
g���i��l~l]��n��+�\��U���y�5�{
j���OhT*���0�����U��f
1�q����+��=e�W�H��2t7��8�?46YZP�PDt��Gfo�F�&����;;<���	r:���Mq����UOAW��Ur�XU�{N�{�F�1ie�t�)�/f�T+�(��:i\������@a������= GT�W���"�'����?b��t+�P���U]��|�F6E��tg��G���t��/�u�S���2x5���S/LBo:�Z`$�|��
��=2���*�/�"�a����	�S�;�!��u�:�M��������s�i��y����z"R�{���nb�}�IN}�����B�/j��p���F�
�YW���n�P`6�P��DN��$�0�����@u����3
��P8Y�!z�@�p:P=2�q-�_"�]GZ���({{V� >�P�R3�I���4]�z��$c�0q�����>�~�g+E�bg�5Ui#��?�^�)9�)���O�oT�k��S�c&���"�i@���"G�V�9M<���}9� �����E�y�����b`��%x�&�.��B��9=���2��������<�eU����IZ�^u���5�����}�f���M�#���e�Ih<@C�@�+�?�P]�T�tM�m:��b����7h�QskP�S�4�V���e�%U��i@u_�P��)�{%�
AX�:��sZ�+.IK�E���)���$zU�wP}(
OlX�a�4������]�w���;;eJ��.sBuL��r#��&��fcv-���,x�,��v���+z�xC.�1-�'������
2N���s��>T���jH�V�Yy5����6����[.�OH����c��7"����B�������b������)g��:)?�z��7���5C�rI�:hP��~����k"y5
��%��2�e����^v6�+����G��j�%����m��~���s�*z��.���������S�j7K��y
d���A��t��U����
��?���U9��	�i@u]WnG��t�P�V]��"F��5��s�t�_%M��*?�9l�X3
����k����x�"�z|?�(=f6��i��/��z��9�eG���2F�
-���ZP�.���hC���q�,�G16��x�h'/�\��I��������i@u�,�]-�W����b��]���.���J��6FRjCT��t�eJtgP�� =�eE�*���[j�q�R=���gQ Y��w�m_6
$��q:���d�A�H+����@�D�5������Fd�as�hC+(y����sO��n�G��.�w�����%$��g@�q���K:�4<���(������-��)����6d�:���zb�el�0����t�����m�P��
��P��#�ND�\��p;�!)�t��I9����"��c�^�����lH1Y~�s��(\�D8T>r:�����^=?Qn3
��I�t��]QbR�S�L�eDEa� =N�Xx����#��Z��s"�������m����������(�Y����^`�hC�*��.Cz�&p���fCyXB�9����t��!�t"c�����$��S���Al����\�#�p@�P��z0��/Y=Do����u��o��(�����`8��m�J%�~�^�h*�95��2��WSE������@�S��v%��i�21=4�l��+n���{p��`%�jc��tAF��Q��o(�g��m;�����!mW�S����P��?��31���#b�D[)����G��7�&Yy���}M��T�'[��i8uzw�K�GG

������R��u?4�o!���4?T(�����s���?*	���iEW��k�L�V�����?�+�P�����[)�����j������c	����DF�e�S�% �t�M�#]�jk��xH�����n�RU��D�R��N���������,{�c��F�7���[)V�#-K`�����?ot������0��sz�t$mdk(�o��?V�.�^`T��t���n��j�����N�����������B�%�	.��n�sR�;�	�1��L*"�V���5h�Pp:8N]6�9H�|��
b�/[�J��'N��P�| N0����qPDt�n
�NMv,�=���N��sP���i_��=���*�|�l���O���lxM���>����s���`�%C�!�X
rtR�������U��'{+����e|�m��tp��%2�'�V���������e�9:�>���l�����=�DD5sz��LGZL���d��Vy�c��:�V�IV�z����������:����5�'8�z�
������=N��WO��'���1Fm%�q�/�#]�!M�R$��%��G�@�U��3�z,�$}��{�����YcN�1�z��>�$T��_�+�R�P��EwFR��N�^��,��HoO>�-���������E����e��Y=GE�b��v�����S���j#�o8u�wA��(���	�aZ�%�t8�?XHG	5�:�^�-��w�-�p�{+	�#F'T�vo"9M���c,�BL=�Z;�o��������s��j��s��t>u�R��J	s�:�`1O������CU�����?V��2Z�F�#�����-����6#��Q��<��?��H1na�')�|��.KH��M'`��eI%�2H��Ez4�:��|YGL�s�t!�6E���Z��u"���M�6�H;�"ErX*�9��������Z�
��G���C�9)&
���Q�����F
�>�5��H����N=���1�*�G����!>��<�eD'�5U��y6R���gUK0�:/������T6N}/���E%}Z��V�n���=���}B�|��N����	�H�}���SU_#�TBw���c�_��x��6:N=h}#�Rr�������-j���.�Q��TI�N��j����X�������<����C�T�=:N�)���4��b��kd����v�~�B�}�k�Y&���e`No�0�����Co���nl����:��	2�5P���?�,����$[�xCSSl�e���9������OP��q�T�6��G9,XE8�������$`��)��:GR^#�,[��g�y��!�()Q2�D��h8�\�zu������u	
�
��&Q���9j��
����\�F�g�e<�
��3��?2�A=���rb1�.����q��]�
;z<�?��O;������m(3C����
����Wm��?*Y@
�������V������&V�4�Y4m9��e�S���`Rr���R�M�.����Gn��@n������K��u�.�Q�Zi���gt���X�]�	�+�Sny��(���p����3�9���q�����>p"N=�V�-���
��Q��U�53;������W3��?��U������	<��\_E���?n"q|��x���wIi ����8�C3����*���i6�)��PED���u������r
���@���'������	��uN]F!k$�#����U0U��x��H�
��N���)�.�@�j����'cS�s�zI�h������Z���$��t���$��D��
��o7: KU�CqxmP�����	�uN�6s��7d6�<X�]V�[��S���)�%��>��? ��!��"#O�z����a����<)-qp>5**�h�0�9���76�>�`.�'m�2�o����1��^����'O����6">F�x��LX'EU�kN]f��%dN'<���Q	\O��n�%���������?W���}����w/^��y����>�y����}��������������W��|}���7���L?����������_?�����o��}�����}q�����7_�|q������.����o�n�:�v|���7�k��������s5oo��|�����I^y<��W�R�����Dw������?�������/������9���X����.���S[y�������W_�|��z�/_�zu�8���w�_�|����~'�j^������;������wq���woo^�������G��q������_�X�����������=|��������������{�����o>[_�����N�����/������o��}��o?{�����g7��x
�C�8�>L~?���;����_��|wk#��eNp��x=~�������������=���������?~~��7_o����_~���am7����#���}��2�����gs�������������g������r�_�o>�����_�����>�����9\����/��<�����wX�������9��7q��/��{��_������/~����p�7�����<�������/���O7�=����z�O>��7��{����n����7��n~���O}v�c����g�?�W�����|�������������;��on>����������G�I!��1=����?���Ww_��������y�����^|����k�=_���>��?���������y��yN��x����}>{���Ssc[)7�?W��������-��ob��n���.�pl�u~�����o��|����7w�������W��?���W~��������������q��t�����7_�����������
���77�oo��=��|��on�������_������7������xw���7�n~SKN�y������?����/���?������\��v��������|u������g����������}�?�=��������=_������G{�������vc����7���������>����Z������������[��������6;������o�C�OKK������ZZ�4��q_��sX iK�����<�����i���z���_��.Z����l���8�K�����u1|l-�A��W�=f�Yk��w�O�w��q�������%�p��ZI�����c���EM?o-�')�k(��_��5_~�fY�����k��Z,���kY�,���Q1r5��)�����G���ZV�����_�w������������m����z�����,���������
<������������}t|��{��,�j
��n�?v-�G����}���H���_����/o�~x}`-���-�_S��o�|!?����1��1]��:���\���wo^|���}��^K-\�������kY������>��/^}u��?m-�G��_'�����"t���^�����W����������~�OX���Z���~p-����^��Z���������������w��������=������������Wo���~�Z������r����?��?i-S?���(]Y�BO�H��U������w/������G�ys��^>�����/7��kIW��5�����K�H��ot��c}�������{?���X�S/���e�n>��<�����S����e�G��v�va��y������C���C�����#����jO��O��t-��7�/���}w������^�G~�G{��/�	k���#������F��v�������\���r=����\�F�p�}�GW6��c/?�n�mw���~��{?�>z-�_��{�����V�3���_�m\����?�������e��C���Z���w���n1��Z������k�v����#����-�������r��'�/o����o���r��#�����������7����������_}�3�bk�~��x���>}�J���$���jkys���x@���y���I�����G������F����=�����o������7wO����`�������%~�{9��w�����g�G��x���V����{�����Lx���e�Zk1z��?��Z���_�O��=�?z���]�c�����*�^�?�7z����Z����Z~�������o��]�c��y?��B�X�O�F?|=D��x���{(������+=�~�Z���G����EY6��?������\����o[�y��?���'����ww����GA����R���#��{�H��C���?�
�sQaBl���TW	FH=����������+ ����/������gw��zvQ
�B��UC�O�jcp1���r�*u��N�^��
����1L�L�����������`�U6#��������>�T�������b�"���xio����vD+���;+�^�Z��u�M
�\^�W�}'MDj]\���W��x�L��!���JMr�r�)��
��}V|�&Rn.,���w:��$��
+�Mk����R%n�B�!�Q�~�.vf�B1���*: �dd������Uwm.�:^j@Mg[�����bH	�9��+����QA_'G�6�NN�/����R�N��^��U�HMM(�8c��"-��rR��7��]���S&
 ��4}Hn9��j����������4eLn9M;������9�����*�P��\����S:Tk�xyX��r"Jd�;���d�����\���h�g������cw�0	����66�����+��W��W�~P�K|R>c)<�eF3�����p2���;6�����~Fm+������g8r
��.j�
��J��F�g��r�8�D�����wS DV���X������T��n�1pp���]�q�
:�zx��K�v��Q�r��Mi�������jxj?��V����T:r�X��r��-�P���r20���Sy;�e�[N�fL3^:���g��Q��E����c[�w�<��gz��jQ�:9*iK��������d�(W��B��"^����S+���QB���s<�>c��r�V�i9����hE:������y�$M;}{��[��������
�[�8Pq����jG>�9����fx[�[N��QNQ&mn9��s���f�n9���n+^s����S�uv��y�"����U'�O�%�~�O!��aN���
���#�h9�u�U�#�'�NZ��tsd�*�Y��$�*�Plu�)�ff��,f8�!���hU��������NsrS�8���p�I�j�w����$��x�=���@q��S�vL��b&���^3��<�[N�I�&N9�s�[N+�3^��t�\�Xy����~w9���@���h
w9�"$�i
r���W���u����/�����P.U
�����]��"�T�5�b���BB��$�V���&t<�x����b����R�R[��������1c(
��D�@��v�()�7���7�����) �d��b��K��Mu��O�KMJ�U����Z���D''��e��	4~I�rCv��c�QG��pE��e�c.�*�:�$(w��c$;�Fx�Vu��w��>x~�U-#/�d���[�*x.S�)Q�8���C+�K_B�����hC��E�E(���)������Y��K[����R�{�3���@W�R����s
��Xr�1Vz�������a��d}�HwA�G�����	9��3�8#�S>��GD�qm@AX�U�c^
�u$����s^����)�t�R�������6�r��v���*�G �'���I����4���R����2�Anj�J}��{CO�4�bb�K�\���]�O��<����j1�,��#��
}J��� 2fX����F�1g����E�5�c��p�1R�Y�.�����H���^.���1�����5�cY�����N��WA�6"����C�%6��"B�^�9��Z��b�������9���m��i��e�||��!mAH����+�k��'�L��(�$3��T;p������I���s�f�h]���e�E���z:�\��y�����i�P��
���Z��q��g��t/�(=�=g������*d�,Al����K7��������%��L8	��
��j�U�_E���8?�A$\��+�W���g5�brO����!�k�0�O��/H�[��W/L������u��GC@(T�� 3�����sY�,��#'��P�J_�?V;�%
L��.PX��PH�>p���n�jq*�����y��d�>g���-�_����B�����������_��U_����~�N\PZ��������j�
��C)(?���c �(��.N?+�s�cF]����u*�W^��<;1�V��e���R,������8)�&��S{�@��]��\����
��������
�i�� h�3y�l�t���&����!l��S����
��3j�W�_j���6J��~sE�c$��::�\�t�F��W�CD�"�,��7����
����J���k�v�5a�u� ��k���k���w��k�yhw75�������w�M:�������
�`�%a������06�t0u1�
��o�+��@t�T�?\�57�H�q�G��0g��^�
��CX9�8�b�k�v}N�[F����W��j:������s�����ht8��0R������A	*�J��I�F���EII`h�,��#@�,�oH�'I������u�'e� �r\p��jO�����+v����!�3�>���������m
nC[���^���*��]��X|28z�Chn��/]�)�O�r�c������Nr0:��jw�S<����R0'��|�����+��x�:����O���J��>����#=�I)��y�����Z��)�Wr0���f]=����%`k�%^������Z�u1-�t�Oek;R������|���S�U\+R:)������s�s�	<.�#��L�F����C �O5�"��
��l%G��+� %/^s_�F�����sJ�u?Mh��2��)���h��!	�]�������5�������\!�^(#c]�L�M�����3�\](+����"yp���H�3������D��� 7�IfN&SR"�TN)v��(�/�p�B���]�`����H��`.(���>�x!Y
s���ptJ��E:]��Yky���3�[����z���c��H����T^m���q�ht�������:���2�i	�7���i%Q4c�CE2����%r�Ar4�����������1�����R�S�H�$r;�r:�s;����L����`�7�����~��0��F@����BTav]���i�c:�2�����R{@QR��`�BdK�E�N�u�@��o��QX����&��&�lh4�XtD��CW:�%��������_��{9�l������`����:�f]�S;g\�v��������ke-�dCOV}&R��,�����O��~�W�z1�-�����m\�$�Z�O2H��(m�=���SZ�����LJD-�sJ��I��	;��M�T�O���O�TB����w]gE��q�����9-��'IP�E��x>'ib,�D;��`vl����9�t`:�P���
s!90�W
X),��';���6������o����������4������3��9R�7Nb�&�~����>vdm��=��s.�t�`-96}�Mt��SG���Pq#N-�=)��V��Sd�96�'K�"�N9Y����I�����������"�4O��/${�9��I��qsl�xW��d�c%/�O�t8��
��)y����N���Pf���,����R�)��l�t�Y�V�z�Q8);4�Rr��NY$���.�X��	���jOv�$��j�}vh��P;�J�����(f��MH���.E��?�i�����q��hi�����n�����x���m7�D����H���:��:K|6��I����^�����$�����
T("�W]���ti��&�'�~���2h��S�Dv�c�e�
�������Y��
��.%9yr������5���.���U��S�4�Ij�F�K�XwB�07JV�sh���$�k����1T�����#�����a��L��Nq���{�������9q���9���L6���4;4]C�H�1��'�l�Q�(�j�_��]����k$f����>��������O����:�l�)��H2���[�Y��wG�����������L��� E�r���L#Q���J{��L���)��S���t��D��yC�L����7Z���gG�����j���L����
,P{?���|�@��#��{�>'++������A=����r3�'Qz�w��>�&7�D�F8���*��7�A��|Pe��I�.Lsmg��_M��u�4W�#������J���A�v�C�������������_u�]R�����Xv�E]���2�t&��#���� ������������� �WG
&;8}?2_���$J�N~@C�9�
������z����y����p���
8=��	8�G�R�&���dr8i��d�YP����w���V���U�`���C�m]���Y(�ih��t+A��jAA������4(��@�W��t[g�����B$;4�V����T�u���5_�������#{�j)K(V��t�	 V�9���tM$E1(VNp�Nz
(�����G���{�l���T�4;6}���W��g,�P��44�sl�- l\3�&�
����z�����j��t�9`�TBqv��V]��'�w~�Mc�]��������U�P�ne�Z!��3�>����@Hn-;:W����d�|7���N���D?�a������X�XT�up��:q���IK����m�,��(0�.�>W��[�@ AG�	Me��9�������
�u$/VQG�	�����	�~N?���1*��T\�{��~��s&}d����X�>���J���
8���
��h�k����M��V�iPt���t_�$��+�P���D�\��]�4!��{��HYLqp�G-�~�����L�'
����|�&+N�=� �>���G_p
F��f��9{��F������]�3��E%(������1��r�����!���D<��u��jS9U<��D��b|N��4W������jy`G~N����}��K"�yAN�����9��rx�W����p�O��{|N�e���#;�i	E
y����#�
�R���8<��&r���ZQM9��E_�;+��)HBIO�Ue�1@�!�cz�>m4I�'	/���}4��P�krqt��%w�D������}����"/(�^�t��
9:�K$�����&��t;�g`$kK$^�9�[��������'�����oH���	�B��(��tJ��HP�V�-N6���a�����r��N�v�����t�S�3���Y�e�&�\��X���@C�S/���H&e2D<��hF�*bN�W�S1�w*["��E����th ��95.qt�v���f2j�:�y����������Gptz��1������X�M���c����X���7���G[e��H]��P��C�������\�$�Yf��1��e�����?Qo������5�@�	x�'#��9��m�ye����e���i	h$:���}k�z�$:���)A��7��1�^g��*�(�N�����L�(c��4�����T��D��3����)�6���[%��z���+G>S��tMz%���YS��4����^�b4~����9�>�v�l��F����n�7D���\��{vtl�<8�8��=X�P/8|��G'�|�z��|R��k<�����+���2F��	-`e[�+��^j��Y	)���_��C�nN�xt�B6A���C�����uv�H�~T��f�L���&cg^|\����L����K��)��>��kk;LW\&����4NS����	@������FR��;�i�$K����.����{vk|�xtq��o`#)fd�+��'Y����y�;1�5T&T0y\%�����Um\%���	C�����*y�T�9TjKx\%���&�c��U�x�#���2m�5*�^�)��w�S��)<�q�<�d?;����LzL�Wy��w^%���k\);dW��%o�p��3���*y�5��J��C��.�q�;��m�;�Ap�	�v�Jri3�U��2%iJ�n�P5L%�����L��.4L�u�OB�*�K
f2�c�H���I��'�,�`��<��x<� �L�!H���k@��flq��p��8�	�r��Y���Qh�M.�����_`��9��eB
���B�����R�R�*��4�-S�����A��OH��!��&�����o{�<�2���
y��yG�C��r�<�.�W��X��p�}�*Z�v� j}�&N����VX�4- �m�����
0:�������G�v��!���"����h�������|Z���eG���]����Df#l�:�*c�~,���x�XF���L&~����j�>S���x�A)�-��X��umcg
�L��+��~r5����QG.�Q�wr�����8fW��.��P��e�X5$���I:��
y�H
�Aqd�|�!�Bp��qj��B��G�'�wi��}T4�+"�dH�N��)D��?3B!���w}�1��[�KK8�U���-����[���H�(�����&p:M�������#����aD-�+��������x�)�C	G����F�e������>v���p�<�^�
RVcW���3�d�����dW{5����
MC�HcT���6D�I��j�[��?��������A�,���*�� �q4��B�pB��@;v�<<���~��������l{���7I4�{ �t)��t�D1�
�������c����M�v�g�j_7��mZ���Q��F_W����m��@��-�����P�J��!�c��9B�dK�C���Q���
Lz4��]�MN�����6/Vd�$KwY<��V��-����:!t�n
3��Ge��Uv��	]%��R:y
��Cv�<<�������*yx'<�C�))�������������J��!d�J����������Z0�V8���?C!�O��-��=�}�2����MwG1��b�jAJTw�~�IZ�K������s�S}&���	]%��N��y�\%�z)�V���U�=���zzv�<��r]%��t��{U��%����Ev�<Q��e��:�;������C������Cq��h3�w�2x��M�cd��R)��_Z���z��o������Ut�t����0{�]C��<�J�	M��	$�7C�_y�<8��2:����@��J�����
y���Vlo^!���[�J�����r�S�6�J�Lpz4���Fi	'i&8=<��>�
�*�����t�C�U������*�J��<�l��d���FN��2^���^�$��L���O��)J��4�m�������Lz�}l�KvDL�
y�6�����������35���M?;�IQ��`Hy��gRRVL�C��B��p����3��1���+���r�<\�A/��S�8�Jy8m�������J���\nxfC>�GC�J����+kV�Sb���Lt�{���s�T)���t�D��~��NG�>~fwV�lS�#]?K&�.����U����}~�N��/{�<���CU#�y�_���u�o{�`@�2�"����4�Q�v����=��/����Z.��u���)�W��)�h��oG�jyt�K,��
�:�G4�� ��p����]?�i������pMl|�g�9���^��$7��#�::e��^�O�c/"�z!�2����I�S��] ����-��-�d���Lp�^J�/���iA�\��N���m/z���&��'�b[za[���+����,�����D�������+�����v�������O��mC�=&�,��y���>'Wjl��ue�>��F�)�9o�u�-��aX7��C�>j�W��+���tt�#lB���\i�o��;�����n<��A�n:�i�1�iE}jyP�Z�2;��Ifwy�2��^-�R�z�6T�D�m+k���`�|�D��g������Zg�mQ#���������"q�P5�Mx�J��B)5��2�.�
�;<�����W�q�7B�xzBmP[}A-W�c�f
-�P���������L�!H,�������
�����]�g1����1VBA�<����W+)94��ZSG��LuC�O?gl���2^k���_��
)�Z��\)�2������Lxz��u����Lx�1C��@���HxzNU��C�E�3�����Qz=S����Q:P�@��g�	m��[
��!\-�]���[=_��������u�* �
M
j�^Y������VK����N_�}t��Jx�\�D�4�2�t��:m�i����jy4�^�KGW�4�j�(��Ke��-������%+�	m�/���	�-���}���eev@�����F�-��}%������rx��Y>
~]-�Z��m%�^W���v�"
�+�isx5x�tn�J|��|SC��c���������l�����k���z�(o>��M�O{/H8���iA���B
��+��"Sz��o�Jy8FA|M��W�����L���C�R�i�/�i�T���(/�0���~Jy�
t��.�u�<������-��u�<�+rcn������d�B\Hh�+���NP�����8)W�������DQ+�����4�����+�Q�����l���xg�/ib��+�Q����j�:�J�'���+�4���|%�d)a�)�C�I���"����0�-v�g�Z���cw�
xz-%�=�����A!��NZ�����E�|8�]%V��AI�k��tY��R��������e�Jh<�����{N���FX��������P�������h����J���8�U�;ue�^��|��=v�<<g�8�j[�%�Q(������r�<|�����pm�u�<\���3!l i��JS��C�F�^W���r��D|%:=�[�:�@����tv�����[9��r:�T{G����:������+�Q����%qb���i��E5�ZB����3ZB{p%�k�:��G��P�����t&8�8!����^���>�<Z��bo��L`�t�
�n�����,_� ��q��N��*��*��_��H�I� ����,�k�9��.l���+�<\�Q���A)C�k@�y����-�zq0�=�P�?��c��h�.T�X@#��G-�,��W��J�bR����J*y�����N�1�T����J�|mu[���*"r}MN=���J%�I��S��W*y���M�������1��	�ck�x�yN���DR�����~\z�B��j)��9le"�8��a9����49}�a�F�������e��X�{
���U����)��F����n=m]C��t�
 z����ix7��w�[lm������������@.YbU)��MMU�l�&s���H`���$$��W�oL���,G��C��@��om�s!t�Un���,�YE���u�������S�<q��<�:j���%G|�
�O<�>?_��e������������@������o7��us�r�0��b������y^F�����
w���b+79���W>�W��I� {��V�����!m�;P�u�u�5�����Rl��<�3�R����F.�; ����
ri����F�_�������p������������w��N]ZwMc��=���������z�*��2�3�C�*Ol_,�*`������s��&Zw;���*�#P�x�[��
��9dn�������i(��@��|�7��m�g~������UC�����l8���#����a�Y<ji9���Z �<@�yz�3��#H����������yz3�<q 
TPr���������U�����V�����B��45z��e���p:�9����9���<��v�����n;<�9�\��bO���u'�\�qa�y��8����Z����G��:&�b��h��:��T��#�Y��O���I��;��'��j�" }�������t�pV+�!�2��VCHx;�>OK�Tr5���x���I~�n����*@w���)����frQ����ol����	��WE�c��{p���	����w]:l�R�p���/�;:�Ua�����V���@���v/5x�tL�/����K�x�
����<�������pS�A����$�G_-�Y��;������tA�v��s�ol���W���s\�G@�
#7lN("&{�'.�ev�������?c����	�Tk�i�P�����NF7/d�R���!������1�J�R�D�wVb��!7�A�U��J`�������8�t��u|"<�h�-_��S�W��j"�������_�G���U`��W.����V����T ����4���W�2����u_���Wa9���Sz��l7����{��i��� ��1�6~y]�����C|N`����:*�m���!OG��Y�������-��
�T��u�o���2���Cf+v5��:�����.Yx�~A�������C�wk�`_������1�72�����G�b�:� ��s �|��i��~��h=�G�Pf�<_�0@���[z���Z��}2Bf��+��,0�=_^�W��XO`���<,z�)8�!o�`��0�;�0�7T��c�@���'�A�������i�]��I���M��*.wJ"�i`�Za��''0������B-xuh�����R��{9��������`��u�w�t����u�br)"������Xh���s��4��BA85o+�i�������sN��brIZ��]E1�	,sC���/`3}P#�� w=��r=����y�REb[�8�!�	�3Y����OH��gy���[G�<]�?����q|��'���pZf���bZ#����d�<�so��%��_2�����a9����yl[7's+��*�@�#;�!�^Q���S���;Q�h��D'0�~&B2�����/Cv]
������3����'K�<�n�9C��~9��usC~\�u?���I�0�Y`�+�vz��=���!��\`N�z������r���0�Tt�$��t����A�\���B��|��!/��c�eH�zF"�:=��kN]��>��\!�'�se�;��_�~c�9�����}C��p��;@�32�YEj���?���$���6��r����N`���B${�	���|^�{��j`��e��F�m��UO`�Q�p:������!��z�W�h�	�f�bi�����3z�K��S��$%��G��BH����8gQ�o@�!�|&�l&b�6�s$'�:p4�&�'0�39��Pm50��<����.����
%�W@`����g�K��O`������%��Xb�!w���7�0��?��o<�l����Y=���'FV�5c����cN���XHQ<yN\�1@c:/O�O�`�,�q]����"L.'
z��jD@W�e�\�q!k�C�m�/��+���C�tV���nN�H��5��!{��]�-�s�$�S_y��r��|���'V�*8�Cva|GH
��dv�C���J���,��]h6+jrw�`���]0F���q�� �ju���F�3�|���z�*�����p��r��q��pp��I���������(���x��AHO��;M��?1���W��&X:�����)��4���pC���m�N,4uC�����|��H�4Qi����<�i���,�sg	������T�	����&�eha�[�"���Jr��#o�������ZJ2������*�=j	0y4-����E�������"���������L���=����2� ^Ct��7I^jI=�R��E�	��ubi���X���p}����:UK����K�LS*/i>5����������O��D���[)�g(�hBRK�����f�����p�b�Wn�����f�E��� ��~���F�x.����t1��X����W��)��:��3�����3R������Ge� ^��)�Vi����?��[��P�7�13���#��l>�j����iEOX�1�H5�����:j���;���j��SjSk����6�g��]�D������|-��Q*�5c+=L)%2���`���|��MT���k�_�����Z�C�K�l�8�{�@'�
�����Ee�N�F�������i������Z�l���I
��B�(Y�6�aH2n������dY�G���������O�;��F��D�uV[@�/��%���
<6P8��ja1P'����G��Y�#z���F����5�d��I<B�w�w��-��C-!�hs�������vx��g��������,m��z�@�6��;��������k����@����RF�2h,�����(�����O-[M7���+��@a�o=xj^k��({�/���O�WT��!�tzXE�cCs�m����[y��:����	CZ�[g�:��f~��'p�o�{�^L�6v,MK�1�:����F������rp��
���_K��e����:��R5��W��|���z��V���K����o��)q����Y���������(k�����
����#���s��D"����\*���X�L+�j==��t_=pe[o��K�����u/��hN�U�`���n��s>1��F}ao���u�rCCZ�b�{���zW�����v=�����I^���aV�p@
Yi���4��N����W�d����a�������V������4�/�������Q+�!f������M�O���A��_o��3�4��J2�`�a���������a�Bq��Fs����W5�����405z'g�}�������-�
����'����}m��%55���l�p;�=;��A����=;���,�6T���K��a�[w�]Y#/,B1x�������~
����/��TE�ge�Pi�(��i�^��^�9��x>��Lz!��v�m�P�O �Q}BMu�������X�4Ef��~H3�����,�"2KW��Fg���Jm��56F��
��ZSb��v��S�����U�=Q����)�a/�
t�t��k������A����S��S�Ig��W������)���6m��f*aMOv����$�o	wYM�
�w�j[��{z� ����+�VSl�J/-A�/����H|���_%���8���G��y�C��P�+���T�c�L�-��a���2mh��������2+ g����
�>_3V�t��R�6`���(_[_�M�Y6�i	e�;\g��k-��$�o����a�e����6�����R�<39!��s��Rm(1�U���SC1����R��{�������N}���wvOb�A�����:��N�t�����H�q���CvUS�s�� ���i	�8����_e�]v&��������F��:�ok�f=��FNv��F���(�L':��HE$�u;�LX��V�>+��a�@������Z��!D�e���7��5s$�[����C��o?��iC�:(W��8���EdtwY�;��Z{FV�Z8e/���W=����"a�W�H~�eD.XS���z#D�J��W/]�gR�^s�]�&����������w�R'z#%�z3���9��P`��RJ�}u�J�rT>���F�����8��WvF
�z���~z0�+�{��&
�I4;x�����'��Tb��+�q�p��vI>+���F�C��)V�~�q�*��`-�Nm�5�_�v��sPW���EY�ug�8��]U�_2�R�N�D��`����FGa��u���H�c�������oM��V�,��NG��2����mUo��������^3�:��[Q���m3���
9��:RM��k�5ee�vH�����b���
:�Q*V��6���&�-�be�����5a}�SW�����A�s)~[��C��G�T(�����b��s��?+�2�;�P6\��N��?���<�>>E;����S��5U;����.OY�v�
�/�
���D|�v����L�jh�/hJB�8�(�8�D�C-a�x�����KT�x����x�Y*�k��]�1!9M�`�E����<�)$
�te������jv(�hg#Cg5H#�V�
Ql�6�F�z����^un��)�Rx2[.��T)�
l(���Om)��_��/���+s�E+T��%(��R��wHe�(�����C.���x�C
�B�T�Q�B�v8�|�Rmh:��l5cjNmU�� ����g����l~���1F������G�u���-�Nm���-X�������S�g�&�AjkS�)����|ce���1M=�����m�'P�N�V-5=���L���UR���7���R���2��	Th���C�o���5����v�p����%N�}����aF��
�z/�	m4<��ym)�1'l6��.���M6xhMQ���������V1�U�x	u��F�geJ

*�Q�u���2��Y�U����y���I��Im�Sw�Z���I#cXBr��&�?�I��NV�P���*�>�c���2q�E��3���!U>��TL��	+S��2�>���jE7VfXQ�xf��`���L�����a����R��t��u���R��9��N.��)��+U� 9Us��j���!W���xf
�5h�P��#2N���iU�=)��������\I�����\�'��u�f
�������P[��?rz2�rzu�m���,Z��x�T9�YyY/���������2���"���bo�v4%g_����k�S;�
-6���d�zt�B�K��/q��>s���(wm3s�6���X
v(�����e���A�w���
(�zV���������X�����<o8�:�����r�[�H;�]zg�<��o�RFz�Cq����L�?4��MzV�����^�P�e�\�>��X����@�����o�J��,���*�q�YXy'�pPk�����;H����d��;1F��j�3��8���N�=JQ�RA�]5]��U��t��� �IJJw�f�����8��l�xO�VK�������;�+�&�S�MT
���&p�3X����3xf�C�>~��!u��>v�L*� s
���F7�W�g�������
�I��g����B�C)	�����-3�MM��h|���l��|��^�����'��gx(H�D<�|j���2�NP7��mf
�uj��~�8u������r�WF^�Q�#1��>te�C����������+�U��ge
ZQ+v-*a]�6��#Q^Hs��� ���E�������/��**���K�g�ae�!#V��^�`=p��F���Ok=p��s�U��{���>�L�U�{V�
% �&>�_%p�3(�?9A��2����^p�6w({���k����M��{	���j<<�N�3Uo�=3�M���N�#DF��J���>�O�������m��gQ�a"G�C���@���n���S���gj�a������_�Nk���@�������=p��������L=�98����te
z6`��2�\�~H���>A�k��8�
����jC��u{}����W	���Db���PO>���k��^:����S;f�9��r����nD�f���~|�z�Q�e����W�7���C��z���4�sn;��28hm�n�1��^VNm�^�
>��<���q����(
�Sd��i!G:o���a��~p#]
����X��N�� ���y�r?��Rdw��������~�.� ��p������_�f����E�(�bP��0�<n��o�kch�Zb�Wd�D))����~e@z��RQ5��W�9���6����U���Z����h���8��VoTp�$Z�U����t>fg��Z=�=�c���R��F��@���!K�����W��8#U��K��PU��E�_sY������V��������-�C�����4��7�K���5�f��8��Z}�St�~��m��c���zZr��U	n��R�����Ig���.4��ej!�'���r��z4x����Ge;e��3�?|�V��L��{��4�B����VO�/D(����;I�z��M������������~5��h��W?1�K�*�xf���A��9�T�z��
G|�PC��:�j['H�^���M���3,,nYY`�;��5��\5=��9j����\J��WO����u%�o�sW�Iek8��ce��
���%\=�\�{��������Bl=�5p�����
��P+������rSHw(�?*�Q�Jp�)����r��B�=�?����q
��G��*�E��L�RF��O�pue�~��>�W/H4�v�R�X�!�����#i�������z;��U����������S*�(������}<��������u����W���v��w�}Q�0�n]��'����y[�PJ4*��y��Q
@�����P��{�6����kz�������u����
#�?����k*If��f�{��v�����$T$�h�)��^�Tl�����pue��[l�{&\��pX�����H��t��
oA�����8���//�6�?8]��5h )��_���>m)�1x#U�W����0z��X,QKhiC�`uQ$o�RVlc�<+o{���wr����fL���;{k�:���BjV�b�����jA����	������b�b�#�������S��CF:��uX|��5F��$�z06Q�����u�����&\����b��yS���������������/r������/�
������E9����?d/A���y����X<e��*"9R���8�Y�3���t$������,������7��L���'KlU��/i��#Gz�N�L�s���$G�W�9����a	#s���m�z��v�d����+h��V��M[�4t�o=+V���Q�"���h�*�:A.��+b+����G��!E�=e	T�p�69v6V�������<�K�l���."�+�A�r�y��^{p%���;R�cRH�%��x��4iP�DOY��i~��5<3�?T$�G;��3����o)!��7�?�n������3[F|�7T�AT.���CIE�������������?��F�T
�I�~(q��1�f��]]�����9��8u��ZA�)�/p�'��X��;�cen���r�`Qg�L5Z-:t_:�c�
���p�����G�d����~�%j��r��)��Bh^�,y=+9b����))��s
0���~���UW������������j���WF��J��]�L���`_e�3�`x����iC�#����y���bl�+cL�zo����������"�hm�����.��A��>Mm(p��vlZ%���O���8u�6OLZ�����mn�}�����2�#qj3�G*�Q
~g��K������F�K�����H7~��z�z��RP.(�z;��]ixf��GIl���S[N[��	}&���;dI��fN}�fF7t����W�����]��~�=�fao������3� �aW�c��q(}��M�jJ^6<��mo5�2C
@�e���N����r82(+1�����\���~�����I�:li���,K��!����-��8�yI��$i��L������Cg�� ��$�z��<��Z���:��S��;��m�iE�U8���N�����JqX����`�saS���8���#o�D�Nv��&6�UR�ck=�����X������!��1��L�SbU���,<3�!J�v�`����
z��lh�n*;����������"X�=�nk"�YG#�����������7R�xD�z����������_}B��W���������d{�2���~hZ27�<3����+��CY���VT1u�}�0�N��R�0�K(*�c�S{�_kt�&G+362Z/���S�m�L6�o���l)��'�>F�����S�Wa(�g�8�Z��W�,v(e�^^x|�h�R�c��Y��O�zo��Q��)�fW��1�B��Y����������8�]���3���ta������QC�����Ge���VH����$�����r����_��mL84��-	�>�SN�"K��GT8u��Q�i<D�����6����^I��_y�oq"����\P�,�0�y��P���
Z�,�?�~�?d���+���M�g��S��r���� �)��^��({�R������hG���}4�qo���f<��s���������7�
V�Ia����|����������^��})���+�e,hPt�����t]I�7<3��^���<!p���i��1p�,,p����J�`tZC�8u7���?C�'�,p�q4z���(Mf�S7�[G\�����L�7J*��'��&N����D��0,q�91swSdY�������hFZ��������W���/94
"o<3l��u�5;�J��uJA.��[�W����j`!���3�?|�J1GB�y�?Z�P���6K��F������������OH���''�I��)����9��VD��S�����-n�������,��Ll����z���ns��&+��AEc��|j��Qm�w�2���-�K+�3p������5
-x����S�S��W�`V������Lo�(�|f��.9��O�+S-a&N��M�&���6T��&G��)�)���S[�7�������w�'�
���-����'�;���%����f�#��\�2����[*��)��O�*iP7~g��y�C�����CcvZ[G��N����FN~-����_(��1�L>���2U*g^����)��6��Z�0+�[}BU�����P�������U�nK;�B�.��~(�?c0~��\���.����Dy4s�)��=3W�2�m�C�xE5����
�k_�������+�����C3�?��/�1�tf�x���O��T������>d�����O{aR�u��N}<��3�*�6m��;��c&���N<Df��3�=�b52D�6p��|�n�_�]���v��y>���H|�����9����i��x�L5D��#��;�����S��:���5�s��Qs5���6x���m*��A�Rs���S��A���#�������fM��9�.+w|�e3pj�
��eZ����������7��1u�����Iz�"��hTm0�T�yoP����]�	��:��@ybJt3<3�N@�P=S�z��-��W#
����+�����R1������fg��TS!3�_2���W���RM3�����-Z�;�Q��D/�&��N�Q��(����3�j��
���Hu��iq/��
6����	A4�����q��~v<�W�T���"F���R�.���5s�����������1&|X2���&G��\F5�\N��P���n/7����;S��5�7�5<3*f�\���x&���������{X|*8�B��{.��� ��X.�����u��������f�z�,�} ���
���.ED��T?�M��W����%��gFDd<e�@��o�Y�>,��3����PLR�3���e��z����Q���`U�lp�@��`hZ]6�����������NF�k$l����H�_
'���,��y*���y<3���@GR�d��T?^4���R=}�:�S�g%�������^}��)T}�S���j�H������,�H��a_���S�
�����e*d�]�6t�5t��z�J����C��Lz�|U�����YS6�W�T�w+�g]��o����,	]���	�fMy]]����
 ~/M�m�����S9KWR=1����5�Y�T?�7k�k�@���)��0����~�-u�
VF<�y���.=+9Pqk����_u�VT��s�q��lX�u�
����Hu9�R{B�6U�����_������"{*�|���G�	�>��T��TUi������6TQa�c]��_q�P��J�WpS����7I��������L�svhG�4<3����%xz�hZ���4�X�n|�R�cw�����:�NW ����c�������y�������u�?�F4����R��E���� �*��_+��z������(��m+�?^��@�P������
�����T�p%,��3zj}f*�����'uJ�n/����D�j��T����9��J���_lV����@��v��Fv�L���md2boC����L���Hu[}�7�!�@�T8D<��_�z�R�����	
����H�r���_�q���@��Pwb��ae�Xm(U�@�7	�zy&��&�u�;���������3r�����(~�3e7�T�p������*����+�gV��W�����})@���ve-�������k����������[�=x�=�6������G�f�e�1��g��E9���z�t�X�S{�+�H�z��)����14WQ�A��J��At������
�z��-��7A��
�zyB����".�T�h����#��^�B�X��c�o����K
��Y+�PC�^�QV��Z��n��,QM��������b���^�L06��`]���h�T�q�m�1�*��Z������d( /�z��Qo����J���R�o��%N���!5����nC3����'��N�s��~��,�{��X��3q;N�d���2�V��k���jz�c��
�zm���a�_�P}&OmH�A+p������8�w���� �w��+r�B�Y@~Gk*+�B�yi�����
�zr�1�P['���}���������3���d�B����l h8e��Q9��0��z�J����1�Y������2*�E-aNmu���^y���	;p��isdP���v��
,��jv*T������]����o^�����i�������6D�i�eP��������N�'w�����lf:�C�S�e��WU����W�:����]�+���|v]yU���=��D��P]:UZ�������}�'p
A���x��
�\���;�����N��}#�2�"�5���SB:�G�U���<(��,z�c\�����^���m��Fu9C�IEm������8u�����v*��~����~����������R����<��0�*x6��z>������*�=����C�#��7��S���%�i�����
5� `�U��X	5 ���Sw�zvtv,��v���-���F�_%q�I,����/������
�P��^��C��,pj{"h4�V�v2��`�ak����s����������p>���x�P�v2R�n��G{)2R1X*T�T�pC�l�T��a���a�M�6�?�n����������������>5�<ez;��G%E��	P����Ww��w"6	�z7��po���v���5��	f�u����K�*�����n��|V�w_��`��k�ryv*����
}E�����`zj{���T�P|��I ����������M��b���������q��$N�7n�O�E�����^
��Le��'�z*�k�����^�>xf�����.��*kd��G��w7�W;�7s����So��n��mae��u�-5�������Q�|�a���w����(�m�������*&#�T�0����6s��.���L+�l'��yO�a�>�;���Q���>5Nh{{k��D�
@�v����C����6�MR���`�x����z�;�=��kN=\Ob0��f+�T��'cjTNw*�B��}�;�?����)>�0R�cs�f(�/��Pd��*�����a���/���m1<3U�UT�b�T��E��w����W���}�Lo�U����U�A�P��YI��B��EO�X��zo��o�?���8���k�D?��6')���NC
}!N������S�k�����XSq������;��N����������]�U����rR�z�[����������*�Ms��
�����(�u��
��r%'�T�m*�Nj�R�P��'�?�_����k�T�pF����L����WW8��jL}�~�<�����I��������L��N*T������SxR�������R��I>�'����;9I�?��Ku�N��_\��ni�\O��}
�����N���	��W�d2����Mu��64��<w�����8��
R���������>V,���N}���h4)�����?��>��O������*j��P}^*VT�h���G]<�`]����+���P�F��I���.�*�:V�:�[�j�v���]�S�*�g����>9�Zw��nszr?T����b��N��$����~�O���}���������(��q��^���M4�(��^iE���t+�V�;#��9q�FD��T��N���B�I�j��O*x�+(O-�����x�s�������Xs=�S[��e��l(�?��?=�E����N*8� ��u
�����\u�VE���s�?�����}��w��������EL��_���^z3>9I�C��DM�$N��tO���2�2���l�R�~����RC��j�	��x�L5}^g+�.������R����>�'������'�O�����k����2�N����6+��,���SS��p��/p��W,=������������*Do�2p���$��9G������r�����U�33�o��k�R��L�C�[���S?����"~�(�	��Ia!X�L�*�{R��1`�����]��rh�x��eo��g����S�cq�7�����'u?'FR��YYs5�[*2���'u?\1DO���Sk�'u?�D�����g'N�sTd���`�l8�S[�8���������.��[KW��gFo�T�+���n#2��yM\x5j}�#�����g���<B���i��2pj������v���I>ue�1��a
�I>uY`��,����3�ih@'����g#���S�2�����`�F�����7����^�8�t�Z'�����2��U�#a�l�������&��yR��]���k�mS��Q������'u?Zy��'�J�~����uz0<s|X���`�R�5:C���5u?�k��������^��|K������5�	[���x(qj���l�){�d��S��7���R+���9���4��n[I�������i
+C�����s�b��X����Q���Q
��J�(US�����Y���,M��M���������T���g�v�j��3QQ<x��l8P�!�@��VR���N0���_%u?�#�=mC�J���9�X�����C�y��Y9r%�c^=��UR��;�����������C_e>�V3�����~����25�n%p����,���r:9��X�`���
��x�L��o8�*|Q3������8e��CKg������1�D����u��o��MW�
9���v�w��m�~��t��t�g��x������y&����n�J�!F5o?�6�8u�U�4N|���}�PX�N�����+��C=m�wY�����X����m����?r��C9�� ��`������C�.1F����JNR,�`��*��8���4�
�����u�|�<aG�uo�O��7n�E!^s����(�I�[�,U�k%p�c������P��g��h�.X�;S)a���;n����st�����P���������^���0V���aCK��X	�8�Y��1�J,���v�G��,�`�~M���)�m���S
'*
q|�����%pb����k��r�x��@u}�*p=
�B��H���t��T�&�Fu�x�'+���>K���j�L�Q��j��U#knRag�&�����	p�5\�z�����.xf�D�|9��
�|:����������:9h~+�U�KKuZ���_e�5k��,�`i�Q��T��S:����p����o<Kw,�����z����3r��C	qn����
2z��J����kQC�!�':�N; ks���h������JUSr�o|�����`�u9A��u���
�F�����������'u�)�xv}a����,
���s:��"7��.�;9T��Y����
��+n����2�U-�T����Uc�,H�����F���BE~�v���P/����q�Q0�
�n����;+�(bi��&t��w�U��j�4�a�2��sm����F�|�-�4r��:�����oe�s��]��cF��E�q�
�c�Y,�HOz&�������>��59�Oe�����5�V�������M��t�%��Oz�Zi��\Lo������=��9U��Hv�������k`e"Hg9����������9t�hAU�~����aw�Z~�f�GjR���/tE2I���j���"�L6 6�D�g�5�����s�����i�����[
@��L��a�U�M
D������[A������h��1o�i]���&k�i@zU������D8�vk�j�5����Sn<5�On�mz������
4�
������zC�B���}�a��m�I-	G
��������[���N9�f�����Qem7\Z�07�z�,��T���a3���v-������W�~[Z�����:����\�R,�O��[�/
��hUm�g���Td��a0�jK��J��r������/������m�����K?w�:H�T-���~�h��V��VGw�)����_���3`Z���&�D�y���(�%k��kOc��F�
H'U�3��19*��*>k� ]�TX���q��%������C�������1+.����1ey������k��diI���������~���TWK��������X
��9��F��6����(����&�����2IH��N���O��!G?1���^7��7��`����k����a@��4e�Ss"c��28��<��O��i�q��WjrG([�&��^a���*Q��i�n�|/�����4U���akB������z��������i�&����f~��,]�R���
a-T����7q�pu45��h��BhP�M���������R�z+!n`x���x�M��s6��g�/������^J������w���}4����zMi�<Oa��+y�)
�6�U����������DN���dx���>
"q���` ���b�V������7~k:�a�'N��}��w�9G��<5�%#�FBA�S_
5]C1���|?�������]�{O ���0�"�V^x;J�
���l��2%O��X ��Q����U�	 "�E��"m
�S}�NS2�Em�E`������f�/�f�EdjJ��aS��{7���cN)����|n�������_����\�� ��+@�����\:�!��&��Cs� ���
�
�sj5 ����r�9��}!��}�����qa�	{;��e Y�H/��Io!�Q����A0���������{UI��w�-@��;R�c����@e+iI�P�0��.����rd�d�V�
7~m�K�-uG�t)��1D+�-����x��.�xiv��,���~�'"^�+���NRU��H��-�r��F�_���Kk	{��zaT��V�[���fu��1-
V���K%�� �v�]�.�SCE&�W����}���5>�a���4t�]L���S�����s�ZK���]B���r�u��O�������
 @*��3����[m��E�_j����i�ljQS
�����	M�L������n>OJ����Hk�1Q��k�OK��y���9K�R������k]�%��u�8�SlZk�1
&��\���\#���c�����>�
����U���]���*tJg*q�%�]���J���~��Q��6��Y��R7���5�/\��Q.��,�D�_5P��}�v1��2p��>��P�"���j	z�����Y�4��l2��q�C4��yVF�I���'�
���g�p�l)����� ��,W.8����m6�r�&S-�::-\�@
[�r��[�)B�Y��$�� o1$�����������\@n#����UI������FG	��\Y�S�%�����7^�K�x�N�T+|��C"��q"����~��=$��_�N�+Pl��C������.�M���x�I��K3Pz��`��Vu�E��� ����m�HixP�G�_AK����{����T�3�K�l(��_��[�����o]
(���[�����A����L[z�p=�:o��P�������7"��)�2�#�T��%��3�Y_1@�OK����"^e}�ciz&���S��$�	yw��5#�t/0����/
a���T��G�@/Up����[�sQw��(���S�d�;������WA[�J���s��9�x�G�.v��g	�{,uIe���~����v��\Q]�4�mN;��aI���3����.��������s�^]�j}�s�BF�������
��^0�Z�������t]��V�0��>1i���XRA���!��n��
�/}���HSA�+/�B��U.i����\�y��O������,H1���;����b��~M�R��{Orw�K;Tg�{U�9y8G��?���+GeLD����m����Uq���5m�g(��]p�R�ix���xX�~���
l{{V�3t@fa��$�
@]�wv6<���;&8�L<��G+�=�I<�gPo�
��T�v�%Mb8m[�Id�4��*$?V]]�������t]9r%�r�W�|/iCD�������:���Nz*h���}U�������������2;qVN��h9����H~�/����T���6Ov�����~���D
c�m���[��
)���}|.���8����aC�M�}����R��-��>���
r5N���=oj	�e��z���o�Z)`��u-����Ka�o�	+����_�\,�������
~����=)xf4O�!�G�������l"��:�{��>'(�������=�87H�����K��%�kE��������~q������������8�%O���f��.���I:Y��b�Q�S��5���b��.��w/y���������kI�P�Z*��do�K�L���@v�TZ��np|��I�7���uzN}��~�8���y�����o�����������b�)���>K��lVt�V6�i\���2�bsuk���2k�h\���~�b@�@*���>n��W}��#rH4��Pv}�MM=6�f?/���js�6} ��-������k��^-����v�i�SDkj�������9&URt�.���;��R�C?�E�}.�������n)��|���\�2�~�Z�)R��n�s���������NdS�k*�l����rw�\�����wuZ���~���j���aI Q��Tb��������>�M�eW�.�R�6)��?����S���oMf[+�Kq�V,��%���A!g��c`�v��KA���.� �����_M�����b�����$y�C���a��%y�:.c���=����/a������y=�'y��`��Q�lJ������G!�;|;'�+�G�v8���\�QE`���~u�Q�	�c�������k�MO��:|��_St=eK���L:�4#����rQ}�h~�`�4O�_�;��Q.�lB����Fq�tH�������W�d�Q4��
8�do�I�1�{���j���i�8q@�5��p������=���8���0���~�����8ySa�-��W��J�YS���Q��Koe��tLk����t���o�+S�|�oM��w�v�M]���y���[���iM�>�5�L
��D�����K�8�H���S����X�be4�,UD�^#�c��(II��G�����Q.���%�h������-�l�Tb4,����e�_�NF���
 p(�=���*����o2n]�M}K#3S����m�[Q���T���,��]Jb#��?�J��}8p�G��@	e��6G�H����:d��`���&��I�PL`����P��'�#���`�_5�8mm�~/��J����58�I��)�3.y�����im$y�gf��X������������Vx����H/Sjx����
��1p��%o���G�|����b_�9
���+��]����/y����k��~��4�m�YP�hUk${���K�2���z���
6�44�m��(_f
/���������]Ja��C���&Rf�H�v�AB���e����!��s��W{�l<#[Lc������6w��ZRi�o�0n��K-�*AZd	{��4��7��<a�����K�WM����.r�b�bM���m�)[w����J�G�(f$��\���&������j@�7�	_��E(����4J��%������wM��z����C��&!��hj{o3�����k�i�=��}NM�
�V���c���R��{��V����D�}���zMiQ|�����W����$��K���Y�_�	�����j�l���\�(�5�n� ��o-�������C<�T���w;$������3�~>O�Z'����:�H�_�d���k�:0�q���h(�:I�v�f�P~�X�!4A�B4�X�6%g�5���Z��|����T���8�d��|#��]�y������D
o���!�YW���]���
�{����a�@w%K��������~�C��(:�G���������~�'������c���������_j�����R���_zI����F3�G��j/�s����w}���i��ZRN@�^3*�4��1*�C���%8r^���88�{�T��}Qz4���{�q���Z����oodV�����#P�����b*�2/W���N4����V�G�����T���
�so8���#Y,�7�+�x�x�/�h����nf�W)��M�#A�)�[����K��O�m�0���������K�R��I�n�.�����*�VS�1����<8���lb��	���En����`���]�6�'g����v�I3���X����1���S!��X�������{��5TAXz#���p���������N����w�H�X��[���R���}�X�D�{}�����d�����*v8Q�^HJ�L,,M�����0������si����c���fy~R��.M���.��2L��w��
^���h����4a���ju��p��a~��}��d�������=�S,�S]AC��/�s=�� bj�i���6�?z���j��kP�	W�v1����4����w�*����H�.�������jlW��O-�"v���I��9o�L��9��+��%@l)�Xz�k��,:�~�K�����1���������0=q�����M$��;�����)����}�$~i�����~��J��ww���%�f�Q���I��=j�pHK����`�#ck���~��Wk5�y>3����c|A�^;��q;��I:}����l�?|+;�R�����L�:o�m��t��p������p�����Q��]��gXk���1��B��6UL��W�����Y�(e���>pL����I����������%�����LUR����q�	������#3��5�&���������^.V�s��:�U`�0=4�'�,'��[`�c(����������+)������������|�5G��\�����fzS|�����%���)W2)bW�m%���\	��_Omx�P	���*�������-���\�
��AK���5d��� �-Y���dTq��U`p��t���2Z�Cl�w-������6������3���{NU�hH�+��U+�N����F�G��O ������D]���q7��=	��[7�s- ��9��
$�-dY ��;�de��k������Yh}x��O.�d��F�;���:b%�`	9�r�R(c�#��ZT�)�1���~.C��������^��������������	�q�b�S�
@[C*��:��E>)����\��9�h�l�3�Uc���VRB�C�$|����z[`J��T+y�^�>8��j��g�	�%-�I?��,��D���mC��7�e�i�k������w�f���� ���������������@{;�[�:
]c�x���k����+S5�
�w���
�j�*RQFp�msV C��������������fL3�����C�����0��d��94&�������$���+o=�s�����2�AI{h1���9�u%")��g��5�|�
o�����zSx������=p;��Y����7�����j ��g����hA�6R�1�2l���W�����l"���������A����Usk=z��_mX���E���������9F3�g'��oD��`��`��������3��4�65��M��� �&��$�����_����z�^�P6<u�Lgz���_W��=�����
%id�3���u����f||@������	���
�����R����D�{Q���i�T���:�kSuPOY*�� 9+��8,��
m0�������%}��-��]����z���W��92�~�H����9����S�YG����[����]K�U����}��r:V'9������9CU�	�4J-+����^P5a��s�CrA���3`�����o��7�>�y����)���Z���\�5[�Z��iT"����K����2�;�'����&�K~![~�� ����!���#������p�6dW�_�9�*�};��#&�������6TaC($���!oU�Eg��g���l|��
����z�W�l�@v53�������iXw��s�]v@@?����To�|��t��3���l��h�\�<3�H�>�]V.���U��Ms{[�`:9�>F*�eq��S�u1r��9f����C���
�2��SW�Q�"��?9�>��)�������>��?�F�Y	���Bm$*d����8oGWR��m�����6�we�����&������8��'���i���iCl~W���S��Q*����OU��V����wF��L��#69eq|���W�^|�I����y�����|���!-��xP3p������r���������?P\ST3p�s^���o��>��]
a;c���!����/�Ua
��s�i�2�6~h�f+���4 '���_�@���)5�d`���
�v�~�U�e�^����s��]
k��	�������aT���z%�T���%�m����\;��U�&���R������6�w)��;V�9A��u��a��"�R�G��4(��Se�J`+�U�Ly4����H{)�R�Y=�J�u�C:��t+���uuZ;0��Y�J�ulx0pf��Q\wd��&Y��|���F@����+���W�E^�XZZ��.����#%��8�M?T�!���]\�A�.~5��_��?�>N/����Uenw��.&�yF+������.4C���;�B4x(g���b��) �������A#4��w,��d��1���+�m�����4dY���)vi�x
��jG�n)ty"�XI�.����|f�9[�^p/=��[J�)	�e�t2�G�v��'D��l�����W�U��<5��p�����#�b���1�^w%��+��'
��&MKV�<�p�����F���_-p���i��������*���W?���O�T'1���mH
0�Z��2
ii�����Xz�B��=��m]���:o�\_,M��B�T�U68���������p_\��MQ��o�V�u���#���	�Cj��F���w���-��g��n�Cj�����Wj�0�x�+_m]���t�.0��{��cp4qS%�];4xg&���s=b�.�NX���EkQ��������KH�E��#���v�
l�Z���~��������v��T�������\���#>�7
�8�[I���5o�m�]���]6m��ai��A�5~*�U�veg�b,l��6���a�
vaN5���t('������7Bm72�.�@���e�d>k
��T���V�������U�v�=,�SR�v]��N���ym�ZWi�WT���	�*m�W\H}X��J�������\�m���������u��K�\#r������X��6a<��P��@�(���.��E&d=
�DC�|���z��&��@�
�����{������[���+��a:��c���R��z#��������iI��f�JL�
���r
�T]Wh���_`)��R��R���~�5}��Wh{2sd'7����6�3P�F�
m����Aa�����3r���
���#�d��bM���k��+�����H�1F_�JM�['i]�,��m}]zv���9�s��;T^@�gC4�����	��m��������)c4$q�
����%(>�4-�YM*�
��f������@���^S�
mOr�n���K�Z����p��/����Mz9����b�\���&;1��_~�z�z�vb����]��v<5���si`)~������]�Jg"��J����d�|��{'W��5b����Q>9��1 ���f��~dtU�����+o�4_Mx�a{S�������+��Uq���������J��]��=����7x�	%���	z�CQ�N;E�w�pL�M��	K�1���J�2��w���wd��t��8n��~L��0M�D:�����@/���c�X����R��+�4e�6�(�~9&�|IH(����L��^�<sh������V��;�;��+��[J�>=n����N���Z_a6�N����w��`t'�m��b��'�7B%�p�m��E�c;���f{�')j�Sp���=�`���f��H)K���*\���|[���cu���ilq[��C��;��u^��(��1�;���M��R����]�DQ���
��`���l�L\)���o������`C����f�* ��'|��B��.�����
�����L�Cu���%�+;�];5oh��]
{4VdHE�B�.�	z�E>����(�|a���~�&��7LS���r�{�����>5Y.u7�No��h*i�dq��,��D��k��\�Vp�i�\�B�_
������\PM�X�]�E��\����Z�,%&�^}���k���J�J�~+�#��:�0pp�VG�G`/��Z������i���~����? [,���1���Q�h��yV G���30���������8vM G~���F����-��n���f�������[fOUF�W�r���\I�"�Qw�6���31�B�o��[�E��K&y������(���X�a��LW��)B2��P@�W G�e�3!�kZ�
��c|��x&��@��Ha%��������?s������0���g�m��@�X��,��ST�
�Jl��l��12�#�\�\�zrd�#X�$���brdi��
Hi�&�1�@
���A�������r�#u�M�
��@�XJ�B�yP�n]�^��b�o�����3"V*�U�}r�w�e�:�[
�A�����f���k��
�������`�mr��_(�GV��
;4mchi�r:���A��~WB<����6:�W�����Hf��=X6���YH��,�PGl�H�ms6�P�"�	H���^�_�6�����L/�.l�Y:��u-{���^�-��C2��72Bqs��dw�
�@��Q�z���[������	�{`d��yAvhou��XC��n��[���5de�U�'��4Egr��4�=�����@�T(�*6zZ�@�4FL�&
e���kT`�c
����mI�
`���t�}��'�
�]�zR��e���U�W�R������b;�#vkW,�PY��HhcQ�"�I�;�#W9�BIR�x���OW�I�~�A� �c�O�
�Hj�{4��]��M
X3��u����C�+�����UGFy�d7xR�������J�����~V�W�9�'�OZ���Y��\P���y�5�&����9�W����^�l���7���6#=�(1����&���H���U��^�;l�dA0l����p.�M4�}�H�����x�&�$�"��	N�9��;Yi�����
ch��9�����5�nB��CJ����M�B<���5-����E���]C�i����r�#}
�t@��[��*���]d�,@�xf Gx"M�?o��)����3;�#���&Zd��gz��]��w���H�lB�IGr��k�w���.����5����\�nC����G /���Z���myU
�-�{l�wc�� K�������t�����R��v������;��/b�\���B�6����T��n��������T G�������;�#�l���Y��!� Gz��F�w G~�}�W�Y��������I�6�rds�1�v��Qpt	
����";�$U� Kz�V����d�t�e���CO}vb�(��!�;bJ��T3�\����y7�'��K�F�a�*�<8�1�P���������qX��/����f	�E��7]���{��
�j����i=IU`N�v�����<�x�\a�&"������X@S��&��a�Z�\*q����]��r� PxV5n��;�Y9P�[�rwX�Mk�_����g$2������-A�W��J����L��bnc
u�h�C>a�Z��W�zF�e���j��	����mc�{�S}r�]j�����g�{�I��nEq��k^M�Vw��,��!w�O�_}��0��#}
u���5�����5�������)�W�p� ���DZ�t������o3�~�[G�HpT\��2��~�H�C�����0jr�zO�F^79A���?*����;������m1�el/�����D�q��W\�n5����
$�3�Kg���P����������N�M�;ad�e�x���/z�=���zD�)�s=s��y5��u�j����W \j�:2��������k��b���z���-��X'6�	�jMa$�*�Y�����\��}T�	��`��0D�?�STt��\���G!�OG�s=+�\d=���\��2�����
Bv��7�1'���������������umHS|<E`=#=����a�-�.
��g���!J���$r\C����7�j���}�]W���
�h�r0�aX�nP��
�����zK��!�$@_�{��Hb�Sg�*N�C��*'��d�d(J�_]
�4Ay6z�$h�����0m`G,O��<3|M���8d�]��z+��)t��)��Yt���}0`a�����u�g�A�.]���t�n=������^j]��]��6k��������r�7�������T��_!G�N2\�~���,i�[�h?�QAq#�����{e���n�����Q#�����i���:�~��[Sd[?&�O%�id[?��za�x��X%B%;�G�q\����F�GIu5��]!>ZluyG���vOL�/IO����U��!�j!`O���^�B�����c��+�u�x�����;~��p����
�t��AIS���n��{�#<VTDv�F�vE�U`x����
{��F��0t��6�,���_&��[�lL�a��nW������
���J������
��q�����}#��:��YI�vE����W��S��
g�����{4��P�k�6�����+�������\�u<�v�����C�YP�-B��M�&_nY�9
�H���z�H��C��-��F���.y�&�K$-^/*�����md�7��?=����]q�n���c
F;�<����k����~]�������2��C����FIyU�%����t�|���C�A����n�����'R J���T�
h���{���f�+��B���p��r���x���a���E��js�D��j�wj=��Y?��@��F��z����>k�_V������H�~\i<��H�3�SE��<RU��&�<#�h�de��k�zf��a�d�����Y����~�-�(����_�6^��e�u����;���y���=��R�/�A�O=.R-�2g�e�#��eYx������1JL�eA�!G$]*<�����g�F��H�QT�34����(V�.CO�5A�
�nj�]�`�Y��������h�I�"b�>t�����sV����u�Kq>��A��]~�>4���j��
�����}e�7F������6����p�G=����j<5r�]��Ee�����;v���c6�fuj�=<��������0G��&����|�u�#�t����j���uOG������I�398"=-�H�<��#b=i��`��������J#�`!�yWuu�d"N�,rl�;�l���qG��W�TDr��HG����f|H����A��b�s�*j]�C��Z\�����t����U��=/�z,�	Sh���F�1�'�^	[40"���	$��/�����Z�/�lwM��"a���/W��%����L����.m��(sSK	 ��H�������$e�������]�e'R*:,���X���-��8U�����8���lW0t���=k)"$���F��<r�j<���'��se+T2/�.� �$�aEU2sD����]&����m1:� �U1K��-���:����LK[c)��QD�$]�n~��,�*�@
�DDz��ph��S?���ck��#��{�k4���8�<9��i�	�Y\�g���.P
���FU�K��
~O���R �\�4Y��V��������� z�8�Cg����O���A��!Cgol�V.h��&=@�|����]��*���`�_uA���>�Lr*=#a9 bq�dX�g�yf���Np7�����vY	�emt���5���m���X	.g?n��!#��;�� 2���.Y�PD��g��o�kh�j��I����?��U&���������M���� ����{$��4�w�D�#�����u-�3���TG���JF�m��=i�;�Chf"lg�'�:���&�21�H���B����������3B��;�D����32���aLa�gd��I7qg/���>;��t���U����?AW_q	{Z�����ps�
vfK�
 0�6�D����M(�`�%�R�����9*�

�G��+�6�EM�W�+R����!@(�`�S%c
P�g��d�
U���[�f	�H�;~}��Z�n�U@�����k$���s�J��.,����+�Omv��`	��U��������M[7zM^P!���\�E�*���
�������8Q	�H��@�;�md��,~���4��,fd�{�B)�]v�|��J<�&Lv(����b���3���L�	'#�#b��CYo�%"�w^��:�L?�*=�E��X\��X	�J}�
���P#<iA���V 2]kd�N�������H�Ix.�X�6K���_�w�b�)���6^[X4`]j	��h
w��~|�x�XC��^H3�u.�X��b���g��J��d�F�,h_����i�E7����7"WV�B���o��a��Lm����Z"{03�����dk���:-(�W/�n�`��c�����k������]�����B��`���e�%�������Jtw�����&�����	zf#��j�V�,�� ��;�V=�F�&CME���V=�u�A�}�w��z��]8]��?�`�#!�q��Zu+�Q�/����L����f�g:�w
5�`�|h�0,V�MBK`�]�^�q��3�#}
M6<^D��t�z+���`�u���4{�h�a��k+��H	�|4���N�opwvD�|Z�mY��u�eP���Up��HZ������w�k���\�7����O+&�I���T���<�p��N��Bp����u��	$]H���v��SL�6!k���������TM�7k����4kt/�[�<�_�u4�[���u�5XqxyR%)�S?'o�E�x���\	����*/+�gO*�o�q��[���R��:�*�l����C��v�	P'#��m�k�������n�
�B�Z�Z:dd"R���>����h�se���I�C,�qt�}X��S��f�,���0\�n[1������� 5�7��+;[���b:2b���U��,�����Y3M}H����q�W	��xf�!"��zG����R+��`��DA"[�D�@�����j��x
t�A���M���;�C��B�EG:�m�jRz��iW���Mg��@]�	�q4��@��Mt�_����s<��c��A�@K��!�=]���
I#�j?�Z�_���bV����H�u�
����m�9pjT�������_��D�5V���j��s\��nx�5�!y'��7����D����m-��v��7��������J��%V���On��H���m�,��#�c���A�������!3;�v)h5�!��QFW���� �X��>��d�?T�G��<j��WSk���{C6E��Fmue�
T���:��4��"U(����6��
T����STUy�J
�y ����:�	_�V��G�:uu�zwf{^�0X�]������1���^/�G�������[	�;D��������]e�Ty6�6Pj�_�_W|�3��7F�{r��>���6��:�����<�m{�!��s���k�yo��5�{���X�8+�j.x��S��r���p�=]����IvvN�'�V]�� 75�3�&���k��c�1t�l��4��_��zp!����3��qj\Hf�M�LZ��zp!��[<���\H���)�M<�T��s�0����D�>����V��D��s�����0/Kz��z\F���������t�������������C�BMU
���#��I^0,.Y�}�dH��3��8��	�)m#]s��uE��YG[1��K�}�~�s������u�6�p���z��c����K��	8(����6:�R�7�fU]��F�F�=�0����u;T&������,���j�S]a�� ��NX���n��k(���|l5��������]LR����0H��	��	w�@�\�f������r�Vf��"t����L�:U�k���r]���
4��L���u��"�7��:C�\��PP(����C��:@��0�����G�]�y��r��=����k(���V?��kh�\[��W�-
�k(����U�B�M=���k�^���{�/v�`M�m�K&����C�] �k��X��������`�5��|��J�b��ih!_[���J�S��$�[�9�j�	>��>Cs��q@�Y������u�>��G���/RH&I�������I�������c!-rg\HT�7W�������Z��4����Y�`k9����
}S�bi.b�N�� hx�)b�0��8��i:L�Ah�2��i9��W\;3c�]����P�.�:����k����:��X�����:����f1 w�*���BL�<����C������h��
Ia������d�D�Os5���;_M
tX�-�[<l~9f�qz�D���e��B���B[WD�BcH�-j��{5D(����������>N��I;�T6aF��JZ�^[�Gn�b=]A�	iy]R����s�ST�Zo������V`SG�I����k���2��BfF�&�}J
ed�B����X���2zX��V�dI����\��X������v�	�Z�����b)�Z�KX�����G�8N��!�v`!+��M������(��B� N����ga�\%��6;jR�@W���/�,��j�C���������
�\���Q�gg$�/���zN����^dZ?�Z�|���{���E����}�gpM�>;�4��]#*�5��j?c"*W�L�g4�CnZ'_�Y�@�����[�-������76_����v���.�9a�A��0�Y�EE;Wv��^X�@-�l��xV�J�,��v��|��w_1��w4]\���tI{+B�������������
�{F���8~�D^������������d�V��#h����A���a�L<�U�70�clM����A����`��!�l_"�I#���B6�Y����+��{��t�����$!�#{�9+�ro**���Z�T�j��f�4���u�y���*������N���5���5L�8)rA���b�D��3Y��i�k���Z�jl����R�4{�u6�9�+������u%�����WR���$���]PG��������t%����8���>��J�l��o�Hl��D�B,[df0�JV�1�d���v�tV�5m+��3/�����b����
w��`���ZFH��D
VZ�=#3svu�V����.�R���R�-Y��a����t:	���v?��RMb^�^~{�B��I[�0WN�bz�B�^XC�5/�+d"go�b$��p�i��`����7�H���
V[�n��N_C[����k12X!&5V�'�J�fX��0�T�I����Mad�B�L�b��~O���6A6y����_?O��b��=h��
�7[��wz��Tq��Y"^��:��gIW|7���t��]?Gl�����3�2!�m��m�m�f��T�,�s���#�(�?a[��#}
ev����PjLVHe��������n!${�='�a������:��
���7x�z����u��M������R�t�����b�m�*��fz�`�Tv��V=z�B,�"kh���mV��TZ@/U���`���W+����+�*;�/�sbd�B:��Y����]�^�"� ���j�8�
Y>��}a���^���2j{�Y�.X7�����R�?��wn��Vs-�@��&.X�>$��"\Iu�-��B	mB���Yvq��I����C��fM�&�u������:8��`�\
�.�BnVHc�����V����;/�g�BZ���n������������3fh�������g(���`���g���jVV��������
�������^=��,b2i��tfQ���n4�D�g�l��j�F:k�q��]fw{����`�\>�x��q���o�fuW�WgS�Ek����5�����n�G�!�K	���u��������u
�R������,<���
d�,���+dT������n�V%�����'�zv�h���+�X�ug+[��N������=X!W�n��u��N�}R���v����j�� �`�]��������T{C��>�^����^�w(�*~/����X�]����
1�~��w������FU�o��w��b�\]��F�Z��BMS9��54�M����W�!�C�w$�\]�~n��ea���7X!�!��&��JV�������V���d���
VH_���������!���u���m���.��Y�	WCSVz��;Q�E��Z�:��HI��JZw������a���:X4h��:������3��f���T?\�^�S�3�����5!�38	-�/>�v)r����~�S����@@C��p��o��I�:��@����L�?R�!�$e��V��Q���1��V�m�N�+O^�2���`�������F�B�������`n}
�_^�L��]T?�������
�:���A��S��%��o�*����Y��:�d+y����H�?��F���N��R'y�j��������U�j�P�X3wi���c��!�]V�?��2r��nk�����cMv�����{F����4����Zw<�r�����
��.���]�+!X!f�u�����
V�d����>02Jb�~�,IT�v+��L=�6'�N�����=!%|�5�t��z#]��d�H��`��b�s�:���	�U��^	���g���vX!����{�p�zo�]������XC
7��bO����qV6�f�s��F�5DF� WG�����������6�
��������������s�`����r�t���f�U�]q#X!�����3;X!��U�a%�N=/-��[����M�/�f�����0pw����
�~'���F��sJ�u���A]�n��5C�_\�����O]
Q����B�'�_��i@o����i��'������|�U�J���������7����{�����b!:�}��T�n�`���A�N�����{��`���&�����~�B���eQ:I��ve��"��#��d�#|��
���Z��8�x_7Gy3��k�{��������4/��~���c:��u���nMy$��]�O����j���Vk�(#]��K��X+ �U3���X����O.#����{�h��\���.2}X
�p{D��Z���]�6S������q���o�g����a�
��}p��qp���<I��z��1;��L��-���F��1��BP����qp!��i���?��:�9�F��5����D~="��������,���0I�\[�.�@�9������e�
��LQ��]�~���o;��)��z���|�����N~V�?�����?�rm�80��O�C�n�F�F�H�H�[����;��O�w�F����������\/rL�LPy+����u����b�8��Pk�6qX!V����a4�7+�L�����F��a��of�����a�O��C��Y�[��Z���w�,MW�+r,�_�#�Q��$������^1����u}���AK���bB�	>C��'���0�gUR�-
��b
�~Q���1*�j��	w�Cm*$
j�u�X
�:(#l��<����EbE�����b����W��yP!���7����_��k�a1�2C�^��X��H�.la�!�,.� -X A����J���-������L����G���Eo#h�2]��[�/�6l��������d� ���b�R�|�T���1!c��E�<^�0!c["����2v/�L��;� �[�&$��fAB&�d
�����B���Ul��BL���H��1�����1�\I*�.���5�Qe��������P���
���S��j,��,��B�����C��mT^��Q�PO��T@���<=;�C����`�.U��3��`]�`h��N�9wR�� >m
{�������vO�����-��mq�?	h:���G���3������|)���D�mpB�G7�k�wzI���{$�KV�3���_l��������H"J�0�t=;�B���i�Z�9{8H���T��3��d����;��%mc<���n1k	�M;_��|��i�����������v4mvH�Z�:~�g_�5�U�_|fT�������%��z����p����\�V?43��mY]�.c���I�R�������t�����/��{4���k�����V���D������
=mr�>��_/�u�Pe2I���E$���8��;���0���������J��B*�^���K��B*���o����,����t|���t
�=#y������d�2d>�'�JG:&4U�\mt]8�\�.�Q
����tE�d����(��WV=�����/G����sh!GA�*��������7����`�L�A��C��D�������$:V��j�ZH�X����`��r=���U���e��Kt;a�>�ZvA�����2?�
�k������l��f8V�e?�i�lz����d�����3-d��F�F��Y��;���5���5d�9�H���������=�m-�l$�d�BF��%����/h!�����~X�I�BFY|&�$�;Ok���n^M�V�B���!2y��ZH�|����uh!e��L�2�)�-$�����T�&��BC��lTyl�����b���O�<���#�����)��.=����}j��P�;���+�W�*���M���*v����m5�$X]����u��
��p��k���we��J�4J�\�~&�l��s���XC���w���\�m��I��~����W���M�N<�g`5��9CA�!�4�&�w~W���c]}._?����������*����.�H��}L:���1����c�(����
Z��}���_A���|�k����rU����_%������8�
ZHj���+o���n��`�j�!�t����r�z����C�bZ.^���g�AZ�"/���&�����7��%�M��o?6V���5�X�\�����<�k�i������QQ��`x��tl��42������	^H�{h�l�*{�m4nBht-�\l���W���+?d�'��qV�B,F��������r�zY�2����3��3�
�`����m�����H��b��:��`���RX��������Sg=�2��K�h�E�a�\���O���k�c��zoo������"�KoW+h!�]�o�������9r����n��_�}j�+=�������z�O��/�Z�i�3���<5�4�k`�B����L4��u-�����`�+Aidal6s����0�l1Ku����^��&t�5��N������F�uJ������t��7������ao]���Z��|[�#�60�i�\�-�w�Yf�M0"X��#-��3���<������� V�������Ek���>P:y������n�|�-�����`J��*�1�n�n��x�����o0���C�B��%R1��!��������D�k%h�
Z���b��a�,��Z#_PX��,h�N=��*��@���3��PT�m���U����N]j5��t���B���o�D�>tv>5+�7b:�-���`M�"�
Z�B
k��E������XeXY���J�B6�/L����_�?#C�W��fN�y���2���N�5+�&�}�;�N�<��%T�������-��:P�N�ub}����u{h!���;92���Pw"�����:�cu.
C�-�l���Y��0rU<�w
������e��:2ZO��
U�z�o��;Q����$���������a��5��b)���U��MG�Y�L���6�U����cQT
c-�:(��8�0����4�l ��`d��-���!��v���LKj��� I���u�z��J��jv�B
�?.���+;h!�����ju�F�uY�j&����fmue�g��
��Z4!�
���v����5�!'�-���S�2$�vvgH>����d�UG%�����o�y�U}�AI��
F�5�y�i�:�I��i��-KW<�3�4���M`�����M����|���%�2�%��4]�����]m�!����:���
��A��v�z�P�Gv��u�-�1�q0�0��Pc�a0������[�?���e���"���	t�����z5MF;���qU�w-�U�^fqA0DA���t��BY�
U�z�H�C?���-�	�'�zB�T�YeB��z_���
�3h!%cn'2�?��So+����1f(�����v��}-�YR��]�� ��p3y0���O�����OPmm5��,��DF��@'z����6��j�`�G�����XU�����6���XDD~o8T��.T�2���T����"�#��~8��
���H|~t�P��`QH>��0�.T�R����*�o���W����E�Dw�mQ���f���E�kK El�w��1x�����:�6�����|b���>�>���Z�-qjs��R�7|7�F�+�{���Q�=#;?]N�p��=��D��������Z��,�?�/Z\�*$uV���X��VW*S��.����V�+���W;��J&��=}gLo����������Z��/��������t-���|�z�{����2�9��s�d�/FH�����P4��#$eD.:��t���2��������C���:d��y���}��W�0��+�?���z����E��i@����k{����S0IQyV�rK�[���c���o�9`��
�&��iv9_u����7����-j��Ci�;tk�4R)2\Q��C�n�+$�/P�[�w�k:�N�8�]������5��	m�u��52^���:9��C����:V��t���k���ELY-�������^J�3�}l�U��@����-�z0�kp�,��7�q��Ag���}�!��Sb�32��
:'��r����v`������y���l���x��_u���,���R(���K�p5�����^�{�Ec�;�Z���������g�m�^�}������E��,��4�h������h���2�;�|Q��<C�e��={G����:�M�(pT*���h���<��]��gd��Z��q���.�GZ�hA��Z�.�3��GQ�Nt�S��VH�!�b�,�p���G}V���p�*��E�x���m����o�4�4��P	��%�#��I�pcSu�z�E���4*�6b����T����n���T��e�U�*��{2�Ti����B���:��YqH!%_�'�L��:��z	�\�.c��~K�UMh�1�2v-��lH�mH�i)��
���>���t�p�Zch��pUj�9v&�NM[MK�����1�,�\�K]��`������.c�2�:�I��gzzQ�L/By���u{��c������N	�s������}+�:EG&;d�D��-K:����l�ul8b2zLQ���.$�YPB������N+��PBf�����f�PB��V�
WP�����j(��t�g��G�������x�i�_�q-XX��t����Wk
�3��Q�����r��H�F����`�c�\>j���\�X>-��)�@A��jyT�A�h1�����F�����b1B*@s��,���������bx��[�
��=���������G��x�0��Qz��lo
|K	1�n�)��f��	H��e��B�#��H�((!A�qd@B6��/�O�y�=���S@B��8MOp<�����������	�g@B�&%OT~��V��#������4�6Lxd	�Y4@_:����+��
N���r�XvT$���m�B"x
5���?##��]}e�`JW�������E��~�������,<���� �o�K�K����+d�e�#=Q�$�N�m|O/���P,Z����t�8%
�3���^Y8�DP������%�n��t�\�Kc�&:1��gzQ,b��Y�Yvv��s������>��JG��m�Z-�K���w���e���1F�z�����X��/ !�R��D�c
$�VX�L�������t�93Y7�H_C#s�X���JH���^0������y%=����-K
��J<#R��������J����	����#a5U}�.b��gr���3}
����R����%��$V��p��Wq	�o�dq�����l[�xf�f��t��1��������4�jE�JH�K����6!������T�\o�}ud$=��Ub� g(���$��8��54����*7�����=�IH���6x�����j[9 !���71G#RQ�H0��w=#��o���_�3t !�w_���	a��(��hL�ayUo<$������������� ���:�,���(�l\^�*���+�����)�=�YC�!���3c
=�B]CL�V8�j�a�K^��'�;������'�5���m�W�6�UmB���t����5t��t�D������D��W	HHfQ��@P�R��t�7Q�6H��������p$���u�	�����&HH&.�e�		y���7��c�MH��f��A�n8ybe�Z��g]x[_CI�#`PwY�� f��<Wz��AP�g#N�]�������=]�V�-�m�>%�AG�Om������;��3���9�{��3��n��f�+��������u���2n���ni����g���s@BjA|Q�U-�[����k�=|����| !.��4���{M4:s��y,�J!un]�^�q$oW��2�����x���3�?[�����v�=�F�r@B�����c@B�]+�3 !W���iH����L�w��H�`?�B��~/�B\-����#V�Q6a����S�M��v/h����Y��Y���@g|O����TEv���H������
�5=�bq���WTK���!�p
)�����R�E���$���k��b%*.�vh��bi��\�^�=F�-1*�`���`(�u���v�����6����>����lEC:�����z���L]}�S���a��u�����?����N�g@B�����i���Y�bz<G��w�z����FIr��K1G5'�%���:u����2��XC�S���f-�@=NV��������5�z�>���5Tq�M��T��,�lmi�Z�=�S/���-F���R"���]��hRo�dk���^!�T���-����T�d���d//�_�_�w��#�>u�V���!+���eD1a%�Z���c�����
��J!��%<3t��6�6#��Jq��g�������yF:h&k��sT4�u ���N]�F��e�����6���!s������
��{6����|�/����3���F�`|�^������i�%z1Z;[]���&j�\��vh�`L�e�S?�g��>�12�PB��e5��*	�+[�;���h������
�3 !��t@G%�Q���NvLf��{�j	��������Y_
����:u[l�UW���V\��@�6h��MPm��N]�*��a��������aLH2�J@Bz�-r�����hT��k�7�}�S�D<
����,kNk�Yt912���:����vh��2x/K�U\��5'��t%$�r
)��9Tm.�S�^����jW��C|����3 !�E���<��	W�;���(I	H��k��^_R"��]�d/�!~��!C��A���]�D�A���$���Z��l��z|9��������~ABvr
��J@BLE�_���U\��c�	��;�W�b�QnI�
����F_*����:���dc���U\�~�No���g�������60B6<F����j�6������~/�<U"[���:��Fr��LT�~����,
�}8��b
�t
�m��H_C��h)+n�K0B����la���SW��m�o�Z>##�qR��<���"���tn�2}�>�\���o��S�r�"���U\��hb_�=����N]��#��X�p���+�����h�?g�!��G���]6�b6��
�3������LO��V����y������g���F����Nm
2Iz�>Hb��#6�:�r� �.n��Y`8]��&�������=x���(�Y���A�Hd$U��(���J}����9x��X�I�SOI>78K-
_��W(�6.�:�H+.W?��
�:5bJ�U[=�&�/�z�3�B��pB����^m_����������P���P��\�6J�z eb�YJ��^��+~k,%����'���>Ki7>��nUW����p(V�
05�Y�������C�,5-����/Z�� �
����f�nA@������r]���q����v��e����y=m��?�o�����\�\��f��O�x[�����]{l��!��9���n97����qD
k@BP)l�}���*
������2j>[���F� �{%�mX��=�6�j���"1��BIa�v������6uh�s=���k�5�7���1*���G*����
��d��>C�?�j��N�������U]�n���*]�^v���s5x����#k�A���R��H�%*�=-�Q$���5�=�����\����z=���X#�S14�ReQ���4�z !�����I�:���f���H%��Vkms�
Uo��8���4�K5<�z8!V� 
8-B��DL�
<�+���R���xf`�6���"G��8��O&d55~����7/EJ�Z�N�R�-i5Xu%�� ��FF�q���7+y4J��G���	�������U]���u�3��?c�'�54�mp���`���X�r�~����9�m��N�m���R�Y���=�='6n��������O�������i`�5��~qB�6�j$58!i�'E�H����������\C��q-��V�;�2�8����S�k�l��������t]��$1��
�����9�@H�`{=��
v*(2�[��vm�j���KX��Z����BEWl�����XUcT��8�h��
�f-���b-A��6r�T���	�������2]��+���o�W'$[�����-��)����
T����XR�\N;�u�	i\�d�	P�:���u|���4�����R[
L� ���8<����wO]��:�`��$�3��hs�Q�]/�)�w��=0�:&��j`B���U���^/M{,A�����tY��<b<���S�k�U�2���3]��
|�����S�mWe�F���e��s����	����jx�LH��J�I���'����}[��&$1D{�l�*1��";�����X������T���<`�Je�F���3����������:12���zX��e�v�,����q3un3��j�_u1{��B�IK�k(0!� ";�LSRRJ3�d=����*��}�?���`�2���.k�	1�C����:��e���l<�z�R�v���
P�����-��n
�:u�����nD��Xjo�kSk.b�� ���/WM-rh!bw���U����=yaB�;}
�|A0Xd���kh,�Bj�Q�v0!k",�����`BZg�6�wL�E���K-0!�b�$�}m����|T>�E�CGj�m�N� r%��=�|I�n8��*3F�w�
 5�`Bj&x�3�+>0!��p��'�F����s_h�j._�D.��L$�:2h��`�|����&m�)P*�J���4�}W�����Y�l5��,K�!^'l�]�<���&�(�V�\�.UC/��>�������};�P������pD;���d������6���qd���0� H���E��@�
:|f�fZa��k�us�z�@�e���T-���,slC6���\����P��U6���cg��������t�!��^-0!�^���{&��1�YT��,0!V*���5�0!;�D�Y��3�����~!���E�ie���,%�7q�ztb
�ZXX�\�y�R�����D=a!yAoW�%�����1��sk	ys��"L��tH�m�3O�#�
���Zs���`oL{��X��iX2��s�ub��FIW����0`wc����a�
�zhD��Z��\�;�x���|��	M��
LHa�-]��H�����b)�E��u&�"�����p0!��`�:�����a�hT��Fg�L��u��B���o��2�T@�[�_���CA�r$�5��g!T����=g�C���#'�+�S����a�C��:��v(7<�1�hH��h���:u��,�L�������u��U��`{DO�#�?t���n�&Dw���M@�#���C|&=)��Z`B
�������'��b�������C=��M��t�*gR�vK�_���x����2����DB����p|�����UYH�[u9+�BxD�?58�5kv%-4.j�~��.���.x]]I�~�g�P�O��ZM��=c��o���U�=��!O�W��F{����tO������v�1Tkd`�����o�k��]���kI./���:�c��ru��cm�.���.������3f8�g��S��e��>��?_+8���p�#��X��.CQ`��Uo�G�u)	�������G�u1k���wi��G�u1�R�KY� ���������j����=�4�M���&����:����kz��4h\4������Tf-jl��*��3�E7\�^zu����XtS?MW�6�Us�%����
k�th���xx�T�qhN�s�����m�����"���B�2`!4��GvY����[.�3��Zi��{��XM�L�R~�.�����"�������.��!EO��S��2��f�2��=�l�����Z��v�Dl;=��!����'k�o���w����xf,%�_��#~2��R�h���a52�#�V���A��������$��x*	�%a�f<u�S	�����W�G&v��,I��T��G&v5�KfX�y$����C�y�s8U$���]����1�������e����h�k8��hJ���*�X6��mx��6���&���452E�-Cw9�N��R�����A�KiP��C��[,����<�Y��-��,�����Yj��67kn�
���K�� �
8+^8�Np{r��C�0���Z��ne�4?�+����a�x������I�����q*7�.�H�.{a�1#'i�[���Z//�=@{$e�L8f%�/I�������C�6r�����zv�{�-W����n�B����6	�8]���]����b��@�e������V0�k�:��g�
���Z�b=��;0M�t���O������V���"����b�0������n}�L?m���nzdf���y�Mc4
�t��5����~pdf�+3;����������}���5����\��$�sF�s�iS��B����a�L��o���.z=���+���m�kx����I���6B�T�%}F�Z%�L�@'i�Z�hV	Z��30c-���MS]���:�2�?�����S._�GDl6nr�|�\���i�7���2�\��%
��Y��Z�L~OH�m�OI�\}���{�~�p�����Y'}�	s�&��n�}�iPO�.}�^���R�OW��{��V���izV���-�Fg�����bh����c/��&����I��{�������������r��kx�c�����+����]�������w=���fOw:C�����Xq�,����������]k=�#��]%l3j����������:����w�0`�T���<u������8�j]�F���*����fe�����	C���D=\�~fX�#�P��	C�4YR+�R�����*2W�����X�PD��C�DS��j�5j�G:^~��_�a~�����D7�~]�~�Bl�mY���w�v���[�oF>k)��p-=�F���jtA��������: ��
��F���'�.85:�:��3�a����
g�c�x��	�u�i*t$����fs��L������&wHD�5>\�~v+�r(e����2����kY4�0Bo���r��5�A���������n�H<���V��,���F����k<�@�������������Sc1��F�a��������=����3����j$B���}_R�7<5L�c��]3
�����w��)����0M�,����Xx�8���8����qjH�!�Z��a?��@����(�i����{��\MS^�e���A�nb���;��W�C�������wQ�.=��V�������hGx�������'c��xE�4�fi"�,��l�������g_N4�����@-sI�~W%z��=�~��������X�Ip�q:#���EB�a������Bhfb<���{�=0!	jn�p��tm�m�Lx�0L�������xa��z���p�O�R���Z_7T&����H,3�������d��ZR��O�{L����P
����'Bm�V���#�������M��=c����'���x��{,��������{l�}&�nD���%$c�k�4�b
��[G\S�Wp���/�4�/MLSh�Jo����?��u�D��f�
��U�Lr!�;������H��,}���P��QDu�����f}4�)���*��O��H�[���V1��)�{���8�]�	��2�JT�]�'Pt���� �^ �����A�����!e?m�|�^F4��
b�.D���#C6FJ�@:���!����#uO�WcHd/�������3!!s�3K��zDc������"00��m����0���3Nc�
�
�P�1\��fq��\���]���P�S�&�����W0���Q�p�{g*`�2cOW�+�������{�3���@��tU�!������bsQ��� ���(�������uk�J��=���"�����j}m�47V���{f\
tI4
��r0�=���G5]��}AW`�$
��h���z@�@������7�J�*�3�������[O��!W�������XI�1��Q����R����k���������K�3���};��/��n���o���:���]��o'���oe�E���S��]O�xj4���=���6����V��L�?C�NV6�l��W��tU����jH�`��<(|,��	v���*��HY����
� �v��������/;x7JC;���v����	E2�����ScJ"?*d��vj�R���F����&���P����<��IM�m�p,�����mP���z0��m����xj�%��S�k��q�YL:���t���^m�9�j�f���A!?����j�l����,
����^���X�+�Fr�����T��c!��m<�76�}h����X�x94���ab�����1��+]��/�'��v��.K�K�������I?��;���sa���:E����5���ej���s����rN=����	+��T<5�Rg�u]����v�$�T����'��(ia%m�i���^-������o�������3��<J�P�8��	i;7�C�f��������Z-��!04���%Z����	)�����T�>�����|u����XK��p%xSU��v���+
?LD������&�0�D
<�2�����}�vI
-:��V���]�n�����?��\�k�	��q���	k���l��=,(%X����U���m,��<����h��N�Wz�@��Y�C#�;�?��,�����A�Q0�����@�����Gfw)�iX,�2�9c)��������4�n�b��6������.�P��z)�U��6
>�v�/�S��@������m(j�g�v����k��E�XM?�7�a|W\�9�X�Qq	L���s����u��kR^��S
l���l����.M���8p��T�����
�|�~z�\p�"��X��&#�|i-����2���Q�3#R��������O��4i�D�9��8�,A+��F,�����]�[�T���}��+�R_8R��f�la��&��H�.c"�S�F��%��w]��x�XM?�o�F��hT`Fnw�	�neA�.S��9�hA��_���=D��O��d\e�M-lt���u�^�]S��b����s	D^UA��Rk�"��9nX�T������5��/L����Q�k�9J� �F=�����
5�z�,)o��R�fv������������oM��eM�����k���fn�>5�����>�����w��<u�~����U����*�C6�B(5/>kX&+��a� r�o=>SEQ[E^�5�pr��T��Nz��||&��
3'&��3(w�N�z���_A�^*�$���+2��y@J����������%���+�����P�10I�0iO�O�A}���Fme*WA>�z�']�~��"�n��>U�u��"�P�A��2�	�r�~������Ttu_�~�+�0W&Y�n=h�R�oTA��:h�TI�A�i���B��?�#����F5�A������hc�C������,�zb)������\��yU�8i�����#�e�� *����SB s��X��}c1���J���O+��?=7�< bGWD����s Y
�����H��^��	O��L�)�[�4YA�f$���r�u�&&�k���v-:4L�Au��:a���D}6�s��]uie�������f��%�-=p�"������?p�0TU�u�&����C�	W�������k��pt�������%
���q��`^!�����j�]����Mv�*]�[aC��G��p_F���L���|*�8����[�*
���������H��xj9�u�Z��U#s��ML��.?�������(9�[�����X3F�r���
x��OT��60��_ikS���5�Z�����E/�n�r�����gx���(5�y��&���aj���+�n�-:�+���
��c�N�f^����9Q��i�o=<
���4
=%L�gkeY=�)j+
d��^x�g
����^���x��
�
j���z��7���x5<�zj������?w2\"��i4|�#Zv�C��>N(�E���Z�8�CO��:~*r�W(���������+�����V}���dR�IH���.���uhd���F�����W�-�g(����\�}��Y������Q��z��!�H�
,�.����M��P������@��@�|i��	x���}�oZ��
X��/;�G ����p��O,�.��n���Lp��
;$�f�;8�{Q/z�>�r�.������v|\]A.�X��8����7C�;����a't����w��+�EA~��9����� ���a�lm���xa�Os�&V����y[^:4�oU��J�P��Hsc("]�S�!��@�`+��2`j��!G�bhs���]N���kc8Kd�%�<�}����JS��M�����
��o�����g1]�!����C�{��R�J:�,���q��0��,�-��^X/�;�n7��syh��.����q���5�q�>j���%�����i*i���C�[]����n��m�KT7
t_�B�O�]�V�������V������19�N?�.�������:�w#�4�k�OC��.���G����i��2�{(���������G[\�3k��k����|,���@���^��|�Q)��;Fz���\6���^TIq{���a�t��4�
������<�.����o���
�������oidD Q�����pMfe*��f
���U���N������`b��(����&�i��k�x�8��fv����i�`?��M9�DK�vH�����B��c�C��I��2�5��i��n{2���4xU������6Y��7)�h'3��13O
�	����g
n����8fg�9hZ(�C��9o�dg`1��=S�u��q_`���.qE(�m�MV��s�C}���g�'��*��6��K�]���}��;����{nbh�M�����h�l2���8,h��&����9Ak�=#c�<��j�$[���Q��l�%���������l�Sm��m*�N���X.9�2�!�l�T�*�[Z�B���l\�3�5<3��_�!��;k}���i�z���C	�6�0�6"g�3m��&�����q�I��U7���J�6a�/�][%�`�r�%:��z��I]��0W����:�}F<x��vBc8�0(6�~1�t��B����m�=�h b��'������1�?l�[]C�{���z�+�����T���6�������� �����
F:�� _��L�:p6lb���>�Z�m/�&����DW0`����Z}a�U�����kx&��/�&K=9;Gl�q��=jG�d-�����l�������M�D�G�*2�:�`�����Y�r-�)�&���u��g:���c@����S�M,AG��L��lru�l�&9��H��[I]H���s��V��D}`������$
VMu�m�(rF�ig�������^�"�)������<4����a=�I)�<�W�����r#XM����=T�Rz��l��y����JG�@j3�g��h����c`�WRk
���������B��i�M�A��XI� #�"}45
�j�~�u�O���P�S��K����\���&�X��j��X�z����E5)�p��o��z�&�n��k�3|$����w	=CCA�J��b']L�jb�
����Mb�������	�9��C]��W��	]P���H�^*J����=�(���d_Fe�Hw�2i2�5x�
S��i�r�V	;���4�#+K�0I�CvQ����nK��Fv\du�|=h]"��4P>�#|������Q"���y�F�nb0�X����PMj��'��� ��� )9a���A�t���qg�����n]�.l?�=D�7�*���cS�(H8�YWR����k� �5��������R_!��f��H�H����:)����QM�#����t=#];z.9�����
���Ty1�7����c
�k
���ls={u:kb%L���������5���������U��J���3�P�f���e���3L�m����;����v){�7Zsw���d�����Mz�J���[m��:��u�5�X]U�:�;]D���	�����w������;���4��._���5�R���sI�����w�!�Z����d���w���������u�u�����l~�lX0��W]��������5�eD}q�B\;�5�e�,zs*������@c�U���O��W���)��a���N{�F{�	�a��3{�������V���gtKM��v���`SR�����Ye����?6��6�c����UeU���87���`��j��mj���'5O����';���I���s3�y.�`�&�a�DT������=WWm�����,j��.�(��g��m9��?����w��(�;Tqm	
{����
�B����{$N�U[����t
��T���nu*_#;�Vzr
���V+,���J�!�`�d�-���!k|�{e2/@W��X�u����]�^��� ���o=���L���6
(�9�?�����*-��4��\���}����A3�@�_�3�E���!�m]�.��L��Eo:9E,mQ�L�s����C�l���s��x��5T�� �@_�W��5����hi��'�^Gz�B�M������	Ml��W
$���19���f]	9�!���Z�����s��xW�wf�f�A
<����.�9'���3�vh7�H��;f��eP?4��K��d`�>�l_v�z�y����3}
U�?gF�������u��n��Uu��p=��Os4��,�d���������*�>���l]�F��R�&����U�2�.h6!�M�z[�%�g{��UAB���s�z�7��&�����dm�[����z�f��W&6��y��&�%����/?��b�F�q/��]�JR6��
���:�1f]�.Xo�HilI�����z��j~��;�ru�Z]��]��j�����J#��f2)o��r��=�P��M'���=�|jx�T�I��B����{�P]��zh�z^��u�mO^�g�R���N��� e������e�Y�,���]�~�����[G����Q����V���AX��n����:���/�9j��u�OU��
w�j�z����W�D�[��w%&|f*��kh05Xt�8�I����u��u��o��HU����{R��o���32�U����`�{�!��K����:��,]�+��������od����X>#�P������X�3����]��9!�,U~k�3}
��5[<U�jr�['��A�"\���H.���:~I����M_�F}��&���m���p�������z������:#��Y����������bhVRv�zO��\�F�����e6b=�.y����x���,5
G��[�@���rsC��f�`}$Z���� -�Q�z��1D�0����G�u9��r�z�V��$:tn]������9R����_�&y]}�I��t&#��z�A����q�"��������K��j��Z"<��k'�ki�-TR����)��NF��m�D�N��-��W��M�����-<����fs��2�p����C��T3
x��� /�X���OQHZ�.V�����ks�j=[:Z@<7�����G���&np�\���#�Sic�/"��c�Ch�����,WpP������Dd�E�u�$\�=�@���	���b�#�<+La�y��;��Oi���!�	��+���w�DYN'�B��e��S��������r����1����z�Ny���-�B	��B�*� @��>3�%��x.D�����������C�Rq����i>��H��������R��,����%��}�'�T���t\]E�p(����4������
�H�������s��!��{H������.��:�~��o�$����z���`�:�5���|�L��)��([\�n��f��2��k� ����<\��]K���_]J��Y:�O5�O�Az�������x`T����>n�<�t��lm)IeA��+p������6�t�|.����u��n=���T5�F�u��uu�L��S��
��rQ�S���a�e@�A�uD<����q���u��v�l�7�%G����6cna�\����J�Dz4e�����uI���hk�6F�������G9Oq�zX�D�����;����(X�k�;���.�����llm���O?���.`������`�F�!��n=9.F�������xl8.Z�������X��a��
����7���R����������f<����E6cYY�������"�G�~��$��B_c>C�*v(���?{o��W�����?�E�m������gjl#�@��a����]M���!Y�W��^x��^z����4�I|��q��//�dU�K�DJ��"������O@G��&XGd��E�����d�u_�hU3��D��\�qArU�Mr��
�4�������\����&\��mG��v�8�VF��2�������!����q�RHX���c41���
e���DM��+���&{hp�s��0�Z��)�5x�V���\2��������/���Dp3�;
��K{]��:�uL��dl�<W�Q����-����L?�F6�n��-5���v����~�)w/m�o]��d�u�z/�+L�;����v�r�-`���M�K
�KEe�%�������VN��G�2"x}���9�q����D6j�u�
#���
)�c�GE�k����x�&��n��ixT��j��=
�nu"�)T��X����|&�5;V�l�v��l�S-�����ms�0Qqpr}�g�{�
\�Z���0h=C���rk���e��HC+2u�[c%M����6��-�:�m������n�N^}<�I��6���m��`M�����.������c�^�
G���J�KT@mUZ�.*���)z���W�9x�G��58u1�t��:���se���l�u��t�U�Mm�v�<�,^���g�H������f[�J>p�9&�_�)iq5����� H6����
��,�����&�I�yO��C|&,^�����=�tH�5�j	[�[��Q��_���}�q
x��tm�3B���%�M����|B���c�U..&�V���C�+k6��'�
���N�BG�
%�YVe1�3�{Zn����t�y���(*�qlW+/�A��d�v�=U��
��9r$9�Z��
�n�����wZs:��P(�ZQ��2������QT��V��.a����Z���n��f��^�fT���`=�iC�[\���g�
MV���p�nC�������e����(��p��:\�+5#2��xa6��7�e�*o?
��K�J���#������M�T�
�>2\|��K���b.�6y
����^'V�J��bm��Z���zv��W���(nDWU�j.nC��<f�^���~��X�!�n'R`�?N��1X;4_��V�-C�������S�����N�JGZ=����S���i15*/�R���Q��lhu��������
����&-���������M��������j �W�����]
d��JE�N{=���cRvYB����W1��w���d������g�k;�*C,;V=kD���g�����uRP�am�m��8k�H�C=Nx�����lXu�08���iXum�z�U�v}8F>�PZ$�!�h����.��0_Z����O0���tH���Y�������5��w��}�pf{?�1���(U��p��;����!`m��&����
u�!f����Ho�Pi}
>A�����S�@��m�������ET��S��E|�V���w���`��@�����L���@���U�WQD4w��� y�]�Zk��a�1��!(iT��+m�P��UlZEu��85��mi�h�����p�����B������5c�
f������1�Z���D
��v�&9�����U��,��@"H&�
#FW�^u�J<&&���v{���y��m�N���{d�����K�O`+����$XW�O�H����r��$#���N="�'�g�n]�:id��Y1��4o',g�p�:B`4��m��)9,��(9����>�L��K�O�O-o*�S���\�E#�����d<��v �p�~��X�O���W	$Q��_�Y��P��'�Qu��4z��z:��6T:U�������q|��N�SG:�znz)��	:�e���<��8�z���+�����Ga������#Oe�����~�N}�V��FxO����"�2a	�S��[X��P��Y5�
��2y�p�4����i���Pn�N}��O@�PT��D��V�,�[���)���Z���N=��]�����L����YT���j ���q�"��p������~zKR�u"6��3�L/b?��������p�*���[�F)��C���3/c����q�N=�ih�'�h0�l�2� vA����u��V�c�8N

�����v����'����f�{z����p�G4�g>�P����`�v���c4Z_�
a�Y�������������Bm[�[W�
�i�HW�<����S�� �}SQ�N}���l��#W�n�
���g��V/���CE���NCq-���1F��VSo����l]�;�H����Ww������B�S�u�K�GM��3���*��O�tc+(_m�D@��N�����l��Wv-�S��qB]���	�w(�9h�T�}i*�Wq����Pb	�����#����t��^���j-Jq�����^=W'8���zF�DW������LW�~DK�
{��+[1��(����������Z���c2���:������������6��N�[�
�=�����|����%�����!j�K���!��,~ [6�:T�=F��2��p����U*���sb1|��2:	�iJ��U��M�����N���k�����4|��,���%U�-.�*.4�GU����rkm�lED���,FdDX@��p����7�� tQ\dI^hm�b��N=W�����S+���H�7��
{x����+����fq)����n+�t0��N�P�
cj�+��Jf�X��s�p����"��������-�����q��9|M����{#2b�f�H�s�z�e0���	�p�T�����_�FZ�t��f�Z?��U��P���D��B )B5]���]$mj�T���������7)�]]��#���+��B 
=U�FNu!��Wt�mJ�MG���1:of4w��H��k����@�&����di6T6�ZB���
Q�SUD�:�zi�)o�
4%��So}����{�����\����F�����i�z�����H��jr�L�`=e@zQ�T����xO����M�gF����������������+P�:��E����2�&n0[�C+�T)-�[4����r'���j���r���]�������`�B���j��tm)��1[��\������H�C��j�����S�!.��6�E}���3P��S��b��8�<�<�8�9[�1N�=�+�UDt��xW���h[��Y�|�%�>��H�#qG��f��'�g��x���s�j�\(��g��Vo}B+�3f�rDT�o<�W�.����*��.�*���W�HJa �����^>4<��(D~���@�r�!�%���C��t��{�����j#����;	Q+Q.�V�Q
�����LeED���!����5!����[�@����Q�X���BD��?���^��_e�3#�������T9$��>��?�)��q��SVa5�z�	M�	����v5�z�g�W�.)������Oz`�R�+b�.��r,8�1��)�T��t�^�_jN���0$�Z�`�i�#"hw2���V=�f
��i�3��nS����-5�}-Z�P\�c�%���"F�����#�x�6�(l3P
�n���I��I�9��	��*�,(j�S�z�*��x��u	�P�.�����~�mMa(6Y�����t�h�]
�NQ[�����9���M����e$��j���/�uA����.�y�t�[4��C�O������6�����R��*NPO	��B*J�W��K��r�@Z2��K����8`�P��(�EX;�1�Vo���@$V
�n������BY��`	�����.��#�8���H"Q�"-lZX^]$��6J� �u	��J�i��o�3\$G����H�����J4��u�AE����
ad]�*~'�����A�wi�^��U��&U��ST����A����E5�:e��W!r��I��A�����Fm�� �rm��#�u	�E4�p���5�%@��7I$w.��A�DL�h�u��D-�[H.U��[e*P��V�U�A��Q��B;M��� �TI�m��f�u[	R��*
�7�i�&���zf7�i��pF��� }B�!W��c��P��`�Bkk6�)��Y�iVs�����IVt���`�uF�N��A�G:F�
\�,P3��N�Em���z����K�F�DS�41[?�z��Yq�>5�hY�Q���Ra,iY��gv��F}���(�EK�N��V�ip#N�U���j��K��|�N?������O��P�I��� �����Q/��K�D�0$Rl4�j.�Y@���3fk��G�r`�Z���n��z�����x��$z�A�=D�
�#]�c�)���WN	�!y�U���D<�);���� K��5W���K�,�sS��i���H��GXBQ�Msj��������L��K�D��������P	�+�",8��C5��f���hD3��-�&��`��k����h$$�V���j�l[���5J�hX�u�{��'<u���n]���3")�%@r� _d���g�|3���{�=]F����l���U3�����>�5��R���� �������u��l�d)��������"��	+�22<#r��%���W���pq�2�� ��=��P��\�2��,W�x���"l���,��r	�%��K��S���H��G���<�=����m�/=2��>t$��a�uIhg��XD5�U�[��W<��3
�.��3��e�>�����;{��T����f[$Pi1��$Z����K�$�����8�]�m��,��X[�C��j����]�G�=l�
��f�\�<N{�VW����Opju���(��%@*K<"2�:�L/��0h�=����X��F������xe����u	�G����6�V�M�������=���;�[�<�%@2�V������.U�{��lJr�r��������4��	�m�Q)8����9����,Uv��W�x���1p
NFoN��&�F�����}ar%"�q����]��vY����������b�>���$�d�q0��g������"�@"�Z=7��;�����a@�t{3������ Y�f8u_o��G�o|�K���hZ4�)������N	y��C�����e5G��r��-w0�:u���G$������%@�yw`�mIo^���S�pd���	F�\������'�^�.2y_"���t	���&Yt���������!�K�!4l�=D��K�R�&%�^��H�x�N��~�]d����	K{��F�G���%�e1v�Y��T������� �-��� P�]$�5�$@=Xw	��s�D$��n8u�R[S2ZSrk7��t���H����}�q���F�c*��]��[;�Ro���������
���e#��K�Tuw��w�"h��/|�Z����S�l|&�%�u���H��(f{)E����R�s	�%
��Y<���K��m��
bj�%@r��� |���QG��Ib�
yC�I��+|O�I����0���w��I��S#�n8�C�������1��SW�H��%���L+5kJ�o�+���~O�A!r]��x����8u��,U���xw	��f~�U�tMw	�B�����,�m����4,�%@�����YE�t9��8[����.���-��t	�L!��=��ew	�G,b����@����b��E#�����Aw���5���I����yJ�� ���5��uJ�g�0����S?�_z��Dy����S�������P�H�rfWHhU����S����&�����z[�
�e�`�s���K�,���0$W�^q	��q���0��)�K��4�#�S`	���M8�CXD�+� K�Z�J���Z�K�t2��>�@�%@F��-��T�S��� ��g�
q7�z
�v�2���K��")��xO/5c� 9�(0��S�LG��w�zo��jh�\"����mr9h��s�[*f����P�\�������t��.?t,�����!t��PI����=@<d8����I�����a)(��Q�w�!J�������p��d������N=��,���g���g�������%@�T�@'!��]d�y��Fv)�~J�P��WzM�>���&>�u?4��:��3P�S�]�Yw�e��Sf�p��^�;Nx����&W��\�p�p"T$%Q��N�z���z�kN���Md�6wo��&[�5���	�����w����{N=�D�������T�B?�DzO�k8���(��S��jD�]���������'��Sd��\����b���KH���[*�y_ES��&.�p�����&�k�:�m�?�!���lE��%��c�K
�GrmS�{�H��S�#]���GJ_����;�O-��m���N�3]F&�t�{&�����s�b���[������2whD"���
�kcY���a8�\�u��WQ�f8N=�%�~%T����y5!k�L���T�P������GE\����0�+y5���V��Q�2\d�g��6�	��F:�!|�����4\$E�H�$���K��Ly2T['�0�K��6NA
?���p	�DA�mF����2���M��b8u�l�,�s��b����������Sc�#��z�9���+#)�3N	�c��uK��:N�;�:����g�
�D�����0�������,�x���*�Q��*��FG����h����J�p��6Qdp��������B��{�#����v�}�4q`���|�6�@��`N]�7��r����H�\I���`����(�m����w@��b5�@V�Ti�`}A����,��dW��#]$#�S�����#]���R0�
�?�^�H���Ia�%�R��R;���1\$��O���s{
��L
�t?���2��H?�R��I�!���jf�(�Udi\����Tf��h���e��������wl�:�QON>���9��!���Bm�������T�l�����������6����]�#���=�:o����������q�M���
V��?6JNE�q������U�����D����T��r(s2���_��q�l	���c�� ��xfG����{Xwd�6�3]��C��Y@C�xj�kLqhE��T*��(4��U���Z-AM�y���
�����~O���h8����qm
g<�#r���mXuIZ-���gV�
�T��;Y����s��!G��h�8��b����;��i��p���[���Kp���l����`���S�����92\�c	h%}&x��N\�cibh]���f��+T�5��T
�~L?*Xz�:�nK�)_����U��v��{����O�����m@@�n�j7t��'������M������5S	A}�s����5�
+������T
A���%����J-��?��
�Y�"w�����##g����agp�GC=��`�jE���i���@J�31��K'Ejz��HG*�7�������R��em�*���R4X���q?]���"��>�b�fC5��B���+���j7��6|�����Uu��p�N��h*�|��l��t����l��{eV�+C+�kz���:��U&�L�1Sa���c��SG��kg�8�O��d>������~���l8���4��1������S����3[�+�4
�n�MZ7�Egr��-V���������W`	�U���PqC��3-�^����Q���Sg�r:�����f�q@���b~Ao��a�	�z?����E��iXuD����g<�6�L�����t/$��UO����{��~�C�U���|�i��B�m�
u��~�1<�^>Am��,�����bW���IU[�M��x�������9��u�i���l=/@b�>r�t0�:
UXM����s�{nz@�l*>>
�������z]����I�v��[���T�K<���t���#�G�ecO�����9���N�����
;�����&�2D������D��)4����V�#*C<�5��w��B�o���#��^��64��F��0\�c!����,S2����X#)D5`�O�T�X5���Y���bXu����1[m@q�V�����a��!Uo��k,���J������	��� �K�c��2�T��r������l�^4
�n�!�q<�7�`����4m5������h��3O���a�m�'�M�G��k,]whg!w@����VGU&X��1��S����:��?���j����klj7����"�kl�����1�5�*�o��z������F��7�����.��Xr������p�T
�"�D����i�l�;�&�
��kT/-��p�	5����;;w"�j����*m����e�Q�j5b���5��!��6E��kTwF�{�������>����8N��O������:N����7�:b��c� n5�X!� ��b�=O{�TVhP����?��j�3���GU��5[�'S{[���^5�
mH����e�����5����T��;���>��s�f������W1�:CK�,V� ���=���jT<?q:L��O�4M���6��6R/����3���
K�:��(u;������XZr�j�^9F�Y6�W����<FzkN"K���C�H����_@��f�zz?"�����:V��W�:�c�k2�Db��V��</k���v3���
�������%�Y�x_V�0UN��NC����e���4���''h.�W�ai~�6�+	����E�!e.}
��b���`nP�U�Z���}��
TdO}�4�l<1b�E���+����6����WV��8^��N��M��TM�\�c��iS���4���&k{��TZg4m�p�t��^X_�zU����ka���k��������'dt�u�C}�kT'ft�@=���~�m������O���*�jh!�w��k�y�6�h��[p�t��4fr���!�G�%�
N=���*��!cW�g8���*MX��1��cR�r�����������;�*�����L�XN�c�d���S���.GjE�q�6�����,g���J�c��Xm
�+�����(�M�q���N��=�t�}���X3�+��-��GR.��>T�c�N�����CW��c��a�y���>��X]j���U����X�SU��s��Y����~L]!F�xO��-��3:�p�����8��yoO�����8�p�V�����?4���S���j��>�p��y���2igR�5�:���n��w���>�5�F"����k��'�r�H;��������?b���3�>��?��HKiHDP�?�?X����D�!��hTib�BF��uV���bjC�S���2X�orjpm�V=����I55aC�S��^��c���;����"@�i	��\�#5��Ag��s��S���}��k8�X��:���v�!������4^��dK5S��3�������v�a���F���.F.Ef�����{*�VX�������M��J��,2��2�~�e��<?��������q-c�N=VE����U�jC���rU�`�"�+����4�$|�p������G+�N����>!�kb�~_�����!�~�m���z�vjT��B<����T�O�����|����i����8��\8u���}BZ0���-���qBJ���Kq*�eq�h���Oxc����u`o�Q�j���6��j��S����7AL]���m�o�v��U\_�>aR�����#��h�'���S����;DV<!N=IU���J�t������UUiu��#G���V8#������
��u����c
������;
��5�Z����������g:N����}�z�%�qi�i%S��jO�
���*0%k��u:�H�
ZU��i\s���#-/[�QT���qt�����ersz�4��6�S�N4����7F��=
���'��M�~���}������7N=��#m�fMGz<Do�������zi��b���?"^7E3�5��������)X�j(�������<��h�%�FuD��^��C<5��N�`�#��r�&���%���PI��������O��%T�k����#^�b��{)n}��o`�x/�#���8ah����vf����y_
��@y��	���s���'�R�T�G��[4������&�
���k������L���>aR+"u�+��Q���R�C�x0�����mU�d�jC�S�E����gw=�t�k��j�)p?�]��pDj}�P'�k�	�`_���Z]G:��jTe������?rG���BZr��k�E��>x	��\X�tUB�H� 
<�q�:V�����	�3����Ha,��Wt������	Q�kT��5f��o]��Uo�S�l��2v5�}2������A�q���h8u�������U�e��-l\3W�	�S��QM&UQ{�����g�Z��'����s��I��Z��u��>�'n1:y&�	�S���Yb}�-��6T�c}�i�����	����gW�b��l#bM��n�j���6d8��o�g�q�kT���V�4������Mx�=O�jV��P�[�����kM[
dE�l
�.�BR���g�j�#
�n7P�����5��t���+'��5b���!�����H����#-U(������Q�b�@'g`���������]E��+2���B�f8�C�8������fCsSe'o����r9�HvY�U�~P�����Pob8�����H�k��1�5�<�z�r��������Uqb��~*�D*�h�p�~Dp�*���N���P���=���	����3O�v���������yoO���������v2��L����2(��S
���"�	YGf�l����O�l����XL�~G����s��SWD�	��RU��}�KQ�s�RD��F��p��N�����7�1���� [V,uus�:�lh���/�����u=)K�O�S��>x�,wez&��[&���������Y���u�z�����K��Q��-W�m����J��p���J:u?24���Z|:u?����w��q���^�M��[#����Jk�����U\����(O����u?�����5���$���
]��Y��%��G���9�����f��@yF��4Z(��l��r:$���t�����	�Zr>u������RKp��wD�#U��iw�Yo����Z��!N�Po�����fk�CcT��s�p��j���
�{�7�%�����X6�V����
0�:��
�ns��#�\�����5������t���*���=�iK����H�=�
����|O�`@?���}1OT��J�����^P�
�m��de7U�{K�mh�B�&vr>���#Y�>!�
M�#)�<N�VT�{�5��&����hf���]%��kIzQ�iyY�3��c(?!9N��	�l`��.4��=��T��i�=n�*�'U0[S����	_�l�GJ�1Fv�U=��x#�;->�y��=u?�*�}%a���Ef����?�O}�z9[F��&��{��&5U
"�z��M���x��5���>�#�����7��|�kT������U�u>���Uo�
��g�vL@����a���Fu���q�e��R�;�����:��\�#2b�����
�v�CTF����
���!L��+��P�H�<Q��+j19[���W��&MN��t*#N}��Z]T���;N�Xz��P�'����J�"��t���1�;�M3P���~$��	qBAFg8u�'��,�e�S�����g��N�3Q�H-��I�S��i$�����^�x��vu����S���7����L�Q�T���7w����7A]�*������������l���Y�u�%��n�;d`5i�j�NUk���� 
��:�$��_	���li�K^FUE���S?�q��O#)���E/EV��"H���W����|��S��l����]��r���e�e�S��Ol'zM��S�8�S��N
����54�����`���S�N�"�&��������'FNr}�L=�J�������v"�q����d����1��7�L�����]��n*%�%Q�9��G�=R�}�\��~�m@��Zv���U���:�L;�z���m�3��'T�}|�3O>5vv`_���R"w2�0[��ZkD��r�����"F�=X�kZc�]�\��][��:�r����fM-�p�~�Pc)�|������C����'ti6���Tc��L#����=B�`T����%�,�`��������L�+�\�#S�.v�|O�l`7
*QA���q����^���l8u[L���x6�:G���rE�����S���s�F�=m��d��X�PUVw�}`���S�h��em�V�'z�3���r���e���4c �h��2������k��L�Ie�S/UZ�_!���ov��{����R�As�l8u��������~$���@~�bo�q�M�:���t��q�'B������e��{+���nk������
U�]U�t�P�f�OJ������	8��s��k8u^\+�M���W���|������b*��8��SO�xf��+~�A��ys�����W�����Ax�y�e�
R����S���6D��a�kX��M�=NA��^���W��������t6}0�����sS���
���
�U�D'�'�>�R������L�A;�MW�x�2r�����|j�m�3�_�vm���tof���W��� �(�5�2��KR���}uL�%h(�V��-��u�+bj��GD���4������]`
1�2���vw3�?��6���UQu�m�x�5P�����`b���vk�>ue����Se�f��^9��&�s�Wq]��J����*����2���Flb8�X8��}��St2��G��2�}Y�������d[�i�3���D�|B?9h<���RP�8N}�Y�^E��	
�����	�NY�e��.�i��VJ,m�0������3N��t�q��H�h�TZ����G�Z�](5�>�����`��f��d�����c�����u?*�0�'r{����p�@����u�������=�����	9`����5��l�l8�ha{&�r1����qa�3�����>>�nz{��[?��
k�|�L�P�T����a���]���7w�QKR��Vm-���O@�EC��v&<F�#(��9�f�k{�u���p�5��u�:�?��B�';N��5C'7K�d%�~�u��/���������j��X���YN����?������P@d�q������R��]��S����Z�B����b8���j6�#������i���_��u��.��xv<�pj�gA'�L���8�}���h����i_\�����}�.��l���q�7!#G�����s0��Hdt�q���w��:��;:�Y��z�'�H%*�]A���c$�ET`���1��[>S�G�q�>&�nx���5�z��������c�n�]��*�S�����������7�&T��S�7�R����N}��R�c�l��*�,�����N]�BN���	m�����X+-���>����<_P���8N�X)���&F�yU2k�T�����c,�S������z/X\��S#gS����8�{O{08��~�,��&)`���������W\���^����#���N����Et`DmPq��*>T�M8��riZz����A[����S=���O��8�:je�B������S����PKq����}N��B����u���Z#�`���2n��
�T7��� S���b��6
<��^:�u?�9|:�;D��Q����:��+��)RX�����{e*��8N�*�5N��P��8uQN�����W�>�	}�����P����I���Bq}��=�&N$����Q%�.s���M��:���K-�O�9�������3B�	���������j�Eq}�F���O{��GAEe�{6�f���"��F������.�2w3�z
����J��_���}0�3;H��SPU�~�W����M���1Ao|]�^n���:����sPAM�).��KW|P$IT?V��7�T���Al��B�b@����W�Vo����4���
���
�z�4<���W�I%�������A��b0�S������5�	����k[0�l��
���+M���m��@������ 5��������u�����`m]�����+S��2<1�$�^^��c�x�����O�K����b@u]�>�bd���b@u]�%��k���g
���AK�3�T�-y�F�A[.��m6�j+�Ih����Tbb��7�:�(�����<�H�4���fC���:d�*�0�xdd�>�	D�m#p��b@ul��3����j@u�H.�r�vT�`�K�d���t��
�P+�I������Q�'%&�������Y��� �R
�N+-�����BNPLR
�>"��R}�Z�
�#����P��.@�p�_��O/]�$UJ�cm]�#� _A��X��PM��e�ST�(��. U
�����h��`�T�U�:/}���+�sU��*�Pi���@���+�S��!������Ket�7�2%�Tl��7���&�2�
�U�����&�5���H�[=���m+��h (���+��O=y�������W%lV�+Y�{jS�V�n��=1R���-l��W0�Y�S�.�K����}��'�3������n]�:��h�ZP
�>l��'c0��j>��7��������_Q�C��!X]�������j>��l�0/��?)�v���J��������Q�#��~���
e�3�P�g�r�v��V~�y'�q�E�.�l��������Io-�����<o#���z'��{�2W�PF �qK#s�{��}$�#���N��:��A�+T���$������`C3M�`g%e�WG�C�|&Q�HS[��a�AT]�:e�7����H�d��7WC����~1�_E-���:��@�t+��T���}������G�aRtU���T���12���x0�:��dG��hju�������q��v��B6l�^��t�z�!�3Y��}T�+T/��]}&,��+;�R-|��
��|^6���_�T�n�#a��W���B5y]���T#���$U|�C�w3�-������9Yw6����fC�i�u$<��y��s�do���,����*����{���T���8�\�cA�Z��*t���.
�,������:�pV�RDY}�u������FMvt��T�9p"5�	�����+�'j��g�PC y�Nfxh��&$PlX
��#��V�e���c�\�J�\�jPu� ���AH|j���q����
�>���T�&��S5��x����T�i�U
�>�*i`
%�F����c��_>���3��Lc���) ��Q���pl5�z���q�?�����f����vuf58�+���PS
���8;R��I��wT��%�
�;�i����g.����a$[��oTI�7�����C"�i�]��[��T����dCs��ne�p�we�SQ���w��;����'��&��N)3����N
���Ni����+��w��w����7��xnMK,V�Z�
���[pkJ�D�Td�ID?��o�B�r1[t�� �w6�tv�����Ak�����)k��-Tn^��]vvzoX!7��d�L��
F7�Y��@�s�Mj�
i�I�/�B	E��r��k�
�[tC'9�oQ��	w��m��%��ni����b����� 'c�G�VuU�c��]Y,��r>�C��;L�[J��*7\�S�O�q���+�jy��y��rkh���������,Tas��J�����yxs����t���~2���<"vy4<�����b'�5OeR���e��u2�N�(�lv�����0+V���}����K��Er;Z�L�T��	������G�=���4��hv��SqI�
���R\���Q��I9"���c��
8��+���".S�
�Jyt���|�[����e��o����r�����Z�+��vg�\7���3$�$l*4���IG\�i<aS�s������E��5P;�,(%:��|�i�3Y6�+TOCb�Up��.t��!-P]�h���F`��!-$�_����!���8O�5�����VP�������6z�P���V=��r�pyMu������b�
�[���������[Z��W�4mN��-���3JN���
�����������y�/'9V�2y���������<��GmnJ�E9�
���)��u�1���������ZY��V��n������]��c��)M�4�Sa�N���f���sT��uwL�q�l����c��GF�X��;��@2��;����nVv����[���vzE��=�iw��=������������`��2jD�����RCFN��O�Rjv>�������sV���i�V�����8-�`u3"�X0�,)	�R�GbI��������q��#�7�����ius�OC8�i�v�%�������m�/Sf$�l����Gv=��bK bR�v���������t���pu�w��:���5u�����ow'm���r���w���
BC����6��g�6;{I�xO�I��z�@�m%�5��KQU�cSO����m�wgt�-G��d�0��������l��������pw�;���Q�6B��;��"�,��#���v��R`��z2vG�S$�zd�:TY�����'�~aQ��N�DZ������v��.r��x��wu��H�a.��~�vw�"�J�!�������e2�{F�����Y���s4��,��g��I����xw������7n�"��:9��[�]��V���<�����a�E���&��'���=
i0&�'��\w�;�F&����
:��V�F=	�)P����JA��2���~7��&�U3�e��o7��g
���Yc���;��
Cv����u�%C�����:��V[���=`��~9�~u����miAL:ych�4nKK�n���w+���iK� �BM�zv[Z<c]a�-A����9�.Tk��G����o�m�
��K>F�A2w*���N��I�mQ�0�{.�����Os�1��J����eFfF.����[v��KH�������x�V��
��z��
���-k)�65���u��=�Vu�����N���c1��m�D��`@�C����d1L�����w-.����V��H�\�T��I
��_9orG,����^���!�QD��p���������Pw����A��uS}y��m+������#����=8�}W�����C:Nm�h[����P�3	���B���t��Rp�~�lp�]{�H�uE����:���3��5*
Vw����)I�Tdo�Mi!��Q����nu��/�B�,�pHo����+����Pw���T��4�����BT-�;���������Pw�O�d����}�j	2;��x�d��Fy���������9m�K����s?tC�������}������������
|��]f�����)SQq�;#V�2��:l���9�A��a!�]��#�������(�:�&7��z?m�_hC���];��p+tT�{���M������
I����#�*�'�
��E��]�������Rv�}�������^�Y8"���=����,�v�d��z�����?{��
y����R%;_}��q�>!��������>vQ���>��z����vj7X�����UT�-�k��(Q�U���=�7��e���S�`�X�[�"�Zv��{g�7�a��xw�V9Hi���U��sU�[^Sw����p�+Y���p2��x(59���:R����5<�p}����Ip�����e���4��h�Br����������w7������=<3����c���rM
�t��m��B��00{��(���Y���a�c%����~!�����"�l�H��������YO�aP�X�>���������zd��Jm��]�-Y��cme�l=Ak.{�Z�7��a")���@��a@�Ld�vT��9���Qw�CV�u}������}����Iy�k6M���q,�a�#��X�\��;"c���H�a����Jm(�W��5�,�Ro�,�n��`����������4���X�?�lmt�^��������ga7���N���^�FU���������e�v(�����h�gg�����Q���h�'t=W}�]�M��lhlk����Fv�?�lR�]
�{���B��>�UG�P�T���5xQ�����T������H�p�����t�V��^e;���P?��!��q����
����Y��\`]��6�up(��l��{���k\�'i���f/E�������.����7�y+�������!�����
�������q ���^	��cK��+Wo8tb@����^�z�R�YIr�zlgh��Z�Qg=M)��M)U��_�-���f��e��Y���|	AO�j��y9�tu�8M���,j���=+m+2���d8zV#r5	�h�1���R�ELEY�����+��DVPc8v����7%�R��c�a��z�����YgKv!����N��|�B�6����C'���k��LxhcM�J�
gi����8���,����������d�:�M�@��Y�a��n��Q���n�q���r��W��t��m�mN�vbLzR�
gig��%�:l�%I)����>s�[J���f\�C�����JI����a(v^�����
e������@Q����:�����J�c�$i/�-����MsJ�����%���4;�A�I���#��\��G�I�^:aj(Ie��$#����h��8I�A%h��r�Z���qUN�]3c]&W��)�&���w�nMY����4�'U(� k	�t��l:��'��(qL*�u�p���S�4T-)@��F��6�B������������=��~��4���K��%bE���Du�t��t�v�lK�AgL��<�����b���D���+ �i������hC���2���9���������Ou�������"�����kG��O�dR2���C�vd��
*M�	���9��nK���q��n�[A�_x����i��a7��5P�T�~�=�H>�v�IDx�����Rt����q�i�q%'j
P���R�tR�4$�db�Y�Y~J�"b���NQF���H�<EI����B���nJ�
[�^;��u�;Q;v��d��Q��y��lq�V~����D�����pQC��.��M���e�<eI
+�PO�51����"���/d:O;d�D��S]9b}��l���u��F	���i�����V�h�s�Y��& ��=��y�,IZ��p��j�N�Na;/X����nI������DeT�|��0d�J�4�.K�����e���o����)��tY�#,�_T���$e�H��:�t��8��O�\�T��U*]3��j������m*���i��]�C�����xN�{�-��LH����<���������']���/��.�I��T����S��O�v$R�Y�&���k������zaK4C�����
�:Y{�zdL�=��
`����01�Y!�y(o>�b��
]�l�f��?�A��E)5$����&Y�2�4vk�3����������
���N���~a�z��,-f'��W��0���������JD��y��=WT$��6.�=E�P���(~�l��UeluxX#��M�w/IV�u�;5*n�5HD~r�'2����o�t�;�����Gz�>�N[��V6{p�;my�VL�r�!����(���n����T��������T�NY�:NGC��q�k��5l�%oY��f�)m��gwSZ�Z�2 �u�#���E�PMPV�t���d��P�z�7]�$�&[������<�I����j!�t�;5��4`\&O����+�q*���C
��x|j?�M����'|qs;�(��;n\P@(��y:
�g�l�!�!��2�!�a��r�uI��ra�����'iG�L��RU+�P1L��M��Kaj��n5JWo�����P��*�dNG��oRK���9\f�@-���t���Y(��[��p�o�v?�Q�u���|oH�s�P��p �T���^C�)-Y����zHC��m+��dY
z����Km>m�v�������o�C�<�k�����%jQu��I.1d�Z��h�4^R2��Bm��AG:��K�&��Q�iJl����G7�����*0�������<�&x��:�����J�im�z�h�j�����f������o�c������y��-fV#�����R��\^t��C
����t������tA�d�A�&�'?�b$��
�j����!��<���}���:I���Y�'����s����'=�9���L��CY�d
���Sp��^��8������z
��D�����{
�l�<!�W��Na�U���-�-x�����j���?�[J&TUi'��(L:��Q����~%�=��SZ}p�5^>z8�q���_=��F���8
���M�B�����\����i���E�2�w�%�����	�opw;B�n`���wa���V���b��n�o�c�j�0IOh�hr4��	�tV&_�Qm��I�"4��������6�JFH��� ����U2"��pG��	t��p;�]��^����Sm����T�C��{�K�Pq�p��"a�P����jH�Z�m����NH�N��>��MQK��$)��N?��jI��M��,U5�z����~��r�������%T	���UF���w<���g�|��^m����8T��KU�P@,h�r�D���[��5rU�������p�2���*�.!�����W/j�o�P*����	�'@��t=�m�7�����Tg�+l8N;�[�%)�h���A
��S[%�xKs�4�����#s��uY����

U/�iG[�fS+�p�|��Z#�9%,K�8Fz���*(v��m��d.�g���e���������j��f��s��������?)�3OY�vN�w��@X���w�"FA�����eIVp����7�K1p�A��2T��VGh����r$�m+����]��if6��.K)����"<rY�KxD,��4n�xO�9U�.K�B	E
y���d�'���k����$�G�|���8��m��.�+PC���5r��C������<�`
G���=W>�{*U����y��!����m�L�A
�p!�%���z�
���Y��������V��!J4H�C�|���Y�y�h�v�_��SP���fCK�P<Xv��{4<�/�X��i�(:�lh��*���E:�#��"��D2PDj�G�Q���+�:�������������F4,��flC����ew�z�JK��bXv_�N}&R������-'R� ��W�hHv��
����F���	��N�*�c��Z�)����d��xiC�#�����P�����Cx�H�G����P"V�{m�-�z&-#���pZ�����U��5�!N�]�����9Ts��x��+d�X�S��0
���=V3HI�&��T��[�/�}J�l�")��9����jn	���v
Ni6TUvo������\��N�������2
J��q���7�l��������?
�y��tR�N����5�����#��*�@��JUOm��a	��M���C�[a_�s����
���L�'Rv"p8`}%�=��x�?#cM=W����A����i���L����V����a"��g*
�>F�'�V)�
����=�1)�G����U��j"T��G��G.����7��G	�
��agl=:5��E�c���kd�������6�P���m�g�}#����D��D�H/�V���8�;xj��d1��BI�U��r���b������"��������@�����
�W��5�uK0�z����>��� �o�n3fk6�(s*
G���;�����q�����{:\��!�0BFFgp��[��dF�����c����u�X=WL
�=�`�4Z9oV���V6T]���7�PR|<R�S��Y���yr� �\�lD?���Gd	���i��
���Ncd��S�T�>������p��)�)xT�p�����S��h8���:�Mde~E��g��^��5�#N=���h<��N}D����
�'���qT��r���5Z��^�QV����s���q�qf�!Ve����q���'R0���;�������KLV��\����P5��q�8��E���1�$����#��9A6��p�Y����f�
���k|��t?�>���S����x�r\�������m��c�9�z�jT)N .	������H�8q{�����+����D��]!W�g�D���B��V+��+�����A�x���`8�l����D.�M!WW�G<���d8�H��	^�<2�����|����/^��y���n?�y��O�~��O���o~������x���������g75[�?~��7���������������~���o��}}��w7_~�����c*G<z������}�����_}���z��do�\��s6on�~���o�p<������~����PF-������7�������oo��.�Y����vw����������V�}�����_~��W��|������o��}��7/�����_���zH������?������kqs?��on^�����7�_�=�����u|�on��7�oo^�����o�}���7o^<��x��<~���7���r�d�1~����/��y���_�������������q�+����0��c���������V��{�W���}n#��eNO��J������7����Wz��]������������w_q����=���W��v���{���o���r���u���3��������ne�����z���������}�S��_����t���O�)��������3|{���;f���'��ON�O��X�/{���8����_���/����9n���~��_����o������xs��?������5��z�'�����o_|{���������|���_}v��>��k������������������~���o~���~�������7�������_)��,�g!?{��g5>������{����_?{u���/����g�E<[����|���'�={�gz�����b{�������������c�8��Rn�?W��������������i��j���&���qJ�>�������_������_?�������wW~�����?���|���?�)z����W7_=?���ono~���.�9�_�<s���#L��w_����^<�y����=���H��������o�������7��%�_>������?��������������'����������o^>��v���w�?�������y���}���/��l���/�9>���?�������{������<�������������������G��q���%0�z�����a���ZM�~�.|�Zy������R�7�s��z5��k�����}.�L��e<�9?�m��Gy�W�7��9C������Z8���}.�A��W�<��Is�OfMOf�����>9������f8b�C��i���u������f��<YrpZ�,m"����p�P���|�oa��E�?u.��bV��^�Fyz$�3�:�9�)�����G�K�C���7��~h��tK,)-��;��]���%��z��i�����������\����c������������_1F�����#�K�!�rs�q�Yu"��c�,1�#�
�b.������wo�w�����2�4��}T��.��������L�%���*s�������o_��������?x.�r.��vG��K��e�:/�Z�}B�^������\h��J]|��_�F�^�����u��rs�j/���}��8�����\������;�r����0?h.7�\��>Z|�q�����w_�w����\���'������?+�����o��|�g������%^���o_}��7�Qs�x���z�.��\�{DR/�:�4�D�r1�%���}t����������^��rse.���\s+?r.��\�u�����������#�����=���'��B���l�������=�����>�m�����^}���gZ��_�������O�G�\> ������1����%]�Fw���������ce~���[��w��1��^���4���o�Cs�w}��k�a?|]���;����\���?=�r�R����y]��?�������7�r���>n���G<��_|s����_^��O�G���yp.v%����H�#����e=�}��s����oL����\s���?���e�e�\�wN����B��������|���F_>���?�g�������]����%]����~}������~��������zk��M�;������$��H~�����O�_�#d�^����&�3�G~���o�y�l�����7��}����_}�X��s�������\���������g�0��x}�}����X�r������/^�g��
=L�����+������g�?�\~\���]���Gr�]�{���Y<>�?�7z��q.;?����\��7z$W��u9��wM�����o����>bx��7o���2��8_}�'����O�\���e������G��q��/~����W��������������(���a��k:�3�w�>�W���1����L��|R�=���������������>}����WO/�YL�mD�]W�����I�������D�Q�2���B0e�7@���0SL������*
�\rK������)t�a�}l��U�aX�����dm����n����e���a��D���������R���6�Mh��HE����`?D@�����Yt	�kbr9.��u�v�6�����_?�������Ka�M�?�O����������G�����������������x�[.#<)�L�W�sI��R�a�$��1��1��O��OJ�s	�)3s�=��5}�\>6����d�����v�e�����h��:B��~��9�ef���d	(��m�����2�����T<��1
p����Og���tX�\j��|L��R��?������.���u�g���}�~�{a/m���<����r�����}�m�v�����1jn*���7�s�������k|�=��x�A�����s��k�6�\�������^����\�O��wa�?��}\����}.�������_���.���w��p�����\���?#��C�P?��}~��)���\M�����#�����[����������?��~}��������H~�Q�r���a|�O��O���a?����'�����'��>�O��w������||��#�����?s��K������H����������������%>�������S���,m�|��L7!�>�?�,�>��?F�h�#����ZO����CU�I�PkhR+��eX�6m�����aA�5a�Wo���-E��w��<~��pjU'����7J��'e���v��9i���:�w���o�v���|�(�f��9���j�^������/'mI������]Z�o�������l�#-����eKk�<g��Z�rJCRk����'�vj��0����djO(o�}��+:d��{��{��6��!=:�#r���|�Z�6���
�����/�
,���w�I�D��@���*{�V��	������OdC��#\�e�0k���y1Ux�:�T��Y7p�wl�N/�����Ne6�,N�����{
5[�|"L�O���*����V\����n9}bq�(WL.��6�-��J��[Nj�����\�����9jP�����[�����+�Sv����������[��Q]��D����rZG��2�V�4E���1��:0������R���H
�5?>v������
FX�,g���x�����s�|�����1n�V	�u�5>�1O�Z+�C��Y��5�-K�~����s�D�m�W����4����T����-'o�#����(n9%cq���	��rVG_9s��I�������@������t#����rq�[N=���wL�92��O��s��JEl=w�����7Jk�����'>'#toU���JN]�	�K����3��P��.m��s�r�/�c;-�YMn��<4����9*,G�6��Q�31�t��6������F��� 
X�[�,��^.�;���#�����O#���3���wD
���������:�Mvs���b3�>.��o|r�F�0�&m3�r�DWPj�������m��q���M
 Ih���w<�������
[�-�}s�����k}{�p����^Z�Gi{�����Dg�D����&����*�W�]�����7��n�'��:�CXGj���|��Q.M|���O�;������x���X��;�nj��IH�X��3>?���x�;Z7���ql����QV�-gtt�N�1	����]�N�Q"����`�D?.g�Y��R��q�8�-g��2�6/�:�rVJ��Hw/-��������yiy\��F���f��?^���v�+��Tr^nc�`��#;jE���DH&<���48>��4��kQeS�
���g����f�!*��%3]�
�bp�)L��P[
A��`N'���=8V}f��9r�����K{�r�����u�Y�%�Q�������<��B9 �N�U��,A��]�����9�}l(��&Hi0:N���Glp!
F�K9���=b_��c>jB��z�L�Z�@w��#��F�:2�k{3F7�>`���sJ���P+�w��v5���&�����v�*'	 L(lO9fct��k,�����L��]#�����17�&TK�V���G��WG-���e�*�q�E�|�PFG�K�������uX9�@E�YNgRa�X�!0FL�1�H�%��+�`Cc�{���z���dV}Hlx���s�6���nA��QL���C���a�/z[>�-��������R;���K�:��8�\������
�����@
�-��;�\j���t���T��L.ud��G�!�J@B��9�\V�>���cj��1��I������}�����%�d���.�A��y�N@=�e.��sp��(#��VV#f�jTA5��G��&K���vGm����f�f�-D��&k^��nk*"	!$��8sM���u��;��Z&�&����	���AB��L���m���,��Q9Xs[9����O���]��XP��YXm�r��&�%�!�r:m�eE���G��9�����*N��%����n�*hS��Z�K�����8��/�mI�P�s-
9Nb�Y
!0����o����r��+,v��-��<r�s����������s���/Q;��>������>V��v�W�u�6���U�o?L�u-��M
��<���C����@��,l��h�#�y�{^� p�?�p3|yZe�LT�M�H�y ��	 ������kNR��H�gAlL5_8���
��y���<7����la������T'�j��@����!��JT��`�{d�.�������/�'��TD���>��]�������SZ7����Wte�3����\A���L�Y7�V��9�T2.��2�X7�������.6���z��������A'���|E
�^P�J;�g�	����'T���S���J����?&4v�*gSV`�����u�1�LbA�R|,�Y�U[ �i'��JC�Q���V�v(�9/��?0}�3�������f�3������QLa��G�M,�����R��G�����7�<��M�&R���<�[�p�`�val�@��?��vz/�A�@�����D��w���K�I��%��\���q�����&"����Z���;��mLJ�+&T�t	?�+<�w�>�2~��=q�7�*����-����J ��\k�-wg9�t+��������%x�}>q��@$%��k�Jk$R�#/wc�z��������;�;�o����&�p����K�i�&��mR�o�������u���*+M6X��g���g���;�%��[�X&P/�t.�'�>��V ��t�-V\+����U�^����D$��X&4�+[�x��2�����J�[��\�"+���8��:�����z��^.�Qi*�����j2�A!����H������JU"=S�7���0��g%��
��MFol�!D��x�r�dV���BG�a#/�N(��bX�9�:�^bM�|��_�V��Bp�����e������t���A� uA�u��N2��{�ke�W��S/t����"�M�C�H��fk'l�G	<z��9��H%���#[]�t�����lB��X���3���
��l�%Y}�yB��t�
��K����-��F9)�G��nt}Y�I7J�������R��%�h��'%��C�y��z��\�M��%�h�'\�����h+�S��;-B�WF�����LA��=�������x��RD>W��������(�3����]ZS����b��G����3Or����6cd�P���_�
���X���
����$�m�g�������Mz�m��d|*6thkt���	Y�6l�
�8���f������A��������tq�����>3lhEW�!�N
�0�K�\)�������+%�.@]�aA�>d�_�L�H��L��k2�'A����U��dz1���U���m�pB����`z�|����+/?hV�,�L�l���L�g��F�L�g��&�
�G ��Pff����A�����
vGyxm�A	�����f�W<�1��!G>����1q@���B
���M���j	� �<3�iw���e��4m���WCh*l6/�l�3�d�������+��;���R��V�R #��#8��]�D���i-��N��C��}����TsK`���&��u����M	x����i�2�Aq"�X"7l�G�}��"��������T����$�����\�O;��$�=�R1���kb0��2�3�
MR�{X���~�
T~�$�����Z)�7v�.��M{[�q�4[�^���i��!��l��Z��k&��P�������������$��|IfNJIc�v�C������R)���N���=<1���;�32]��aB�����/���m������b��3�d�����%*@�6]� �����7,���\�iL�3hA���@\N�f�+�>��($����	m�P���1-��Z������$3�RHl���t��6xv���<&dO�*�/��
lze���h��]��N�$3���e�c,��+>F�����gl�����,W�G	t��[G�|������7������J��5���@��J��	U�J��#�U��<_����g%�{��pC��ms���@�kO�8&��$��@��dT��TiR��N�(Jp�e�������c�Q��
_��-�v���!�D
�0��nH��R�`.[��qk����J���
�i}����������%Lh[Ll�J�<����	�����t�����!}�����1��$�U�6���Q����/��m����R��>�O�5+���6m$6m�6�zB�O�#B�P��zV�nUw�'����>]��W�����]��v���r���4*x������j��$��V�.�W�cH�i��J��@����"OaB�3Y�����pC#e� ��l�LM������t��+_��V]7���wx��B�<�vm��30]�g����
3����x��v��i�uj�64���t}`, ���e������O�[E#�O������(��[��=iUe{5[�*1�b���^	��wC��mE0�w���>�wG�3�����iV<V�:��{���������{�
�^��A��k��]z��z�HUo�M{��&V��O��)�_�����o���g"�v���;.R���5�|�o���i����|z�2T�k�g�	���>]�
|����2p��>�[���v�v���&v�:.|�����,�jC�O�}����<�t����U��9��<�o�bT��(��}��0>z[1������^HS�����`��(��	y��������3�<���=!H@J��y��}td9m�.1�>��v��h�~1[|z%�
G����;nhf��h������O/���!Z�(I>�62�w�����=64v�)7�@)���m������.��@������x�%\�:��� B����P�����@|�UB������L���C_�����J��O���a�|��z��+���������e�N�i�}� OT��3��3�	�Ma�B���qx��WB&8�U�1<��;��W��M/7'k�C�&=*���2o-i�@R�K���NR�t��O�������"�3���RV�IR�������=��cWX������G���4��R2
/��	������l�&4��hE��P�#
x�1�7��#j���4�*�daP����Q�qfu|I��~��>&���-�?�|m�o/8(S��K�n��*����Qg}�0
T.�A�/y���&�v
�?��T�(�0����<=��_��.9���N��j@��\dc%��=�NO����j��F��>=�0�^
��2=�~��BH����W��O�5k�4p������{)\�d����7�/�En�O>m+z���v������sC��P;��A��/����n�����Cxzl�&���%�����V������<=�s��j��E������r)Tx�������^*O���G��K%��mW�4������&���T�Ge�����PJH�x�4�9�t�5A%{V0]���]�P!{:�#�.j[
P����)�tz�X���Q�]���/��s!e}5�	5��z)G�����B�c�������{��{������5�>P1oaW��1L��yT��C�cgE�)"�U5�YQ_m�o���~k��Z24�T:eN�����;��y��HW��L���GFl�(�h����AM��Z����G!3�6����z��M%S[�A�m9��c	e�2)K�g\|}&����wV��XR��&]�9	P-���1�r6P-�`��#7>�<�M������`�5V����N�(�����~�����k*@t�\��y�F��LEy���G+���z�Y��;~�<��:����������A�#LR?W~����4P�4�N�E�9�G����Q="��]E��M�S�p3*��;���9�uX/*���*����J�_V��/-7%���U�x�������J�T��<Q*�E�Tj����`��Q?�<�Z2l1����gEf�2�H��&�����;Ey%��y��
U!�����3||p�
O��m`����h���^������(����r=��I�J��s=*GG�dO���U�����QP�H"�z=��3XI���*z��y���+�,6�?o[�m��Ze�8�$bo����?�����]E�AN�Q�HT���[� f���j}J��
j�B�8T�����]K��rP��+���ah��c=�t�b?���:�,�����(����G���#PH[���!���{4J�"L��(�6
�I�^A�G_��w����p�������W���g��[�^E����Q�A[W�#n@Jzv��������#����Ay�w�������L�0�^P�����j�P����GED��zA��y[l��A��C��|���+��G����4�(]ZE�w�h�I�oR�hvd�jC�^9��P��=��4������n@�>������\�����w��J%U��RF�1�u���xf�R���-Y�~�v�W����B��2L�e�)����(@����Z�Ji��mg
�h�Y�Xy���|wA�Sg)�J/��p�������,�kBD�8+
a���G+���H������"?����}Hzt��p`�~�+��9P�y[���GydD}���px�G�C��+�a33S���O�5E~��D7}y�'o2u`W��i}�t��W�q��,T@&�R����6k�={�V����U��+��zi:���.��`���N�n���@0R���kBl ��v
L��G��.M�������3��%}�����s9�	����<9�U��@�=�#����]Q��,0���gp�mPH$��vE=y�|t)�+��9����A�i�Q�7��ym�R�w=��;��QCP�,n���W��M���;8�;��aC�{>]F/�Es���s�p�:�y�y����`+p]J����5�i�m�j�(�>�i�����zR���Ya%ol1���}�����K�giz�w�e��h�$<�;m�^
���]]Vuu��?D=�z��)��}�z�iPA_Q�}�@�.!��
9j�:�r�t(u����C��+�*�"�z�
D����mB���9����L��
IK���;1��
�IW����IrB&����'�>]�e���9u�w������2����3mN-�E2Yy�f9�mq�u}��c��Z�����vE=����8�RGlW�#1\$3������b�<S����z�V��L��jqT�3L���^y*�Q�����������������S��{+z���@�$�22�2v���NU�<��t����7v������o��J��]�V��}�zN���*�E=~U;e%C1��k��:�vkP�Jz�]h#~��mthV��LJ9-H��eH�&hr�*Q��������2T=:#���|�vU=��6�(m3oW�c���uO�|���9�C*JG,�������K��i�
����cO��@��{�J��%�lW��w�����L���� �>��Q ~�a��������QD�P�m�G��NU��{�z>������L���XL
�#0���y�� -�JW���$
���\U�f �N2]��<���N�w����s���T�l�B�Ja���<�����Q�E�N���h����+�>��{�-�]��4�����h������pU=F���[�k�2�W�Y�������fw"����YoD��F�L���}����-+^��������)����;n��A�B]l��Q��&�����O�_spl�g�	u3�z�sj�pU=v��hh�����u�a��-���++[��m/��Qb���Q�P��*6b7Uh�P��s|GTK�B�6.�u�p�����L$k���*���P �%98`fzR�B��r��(E��6����������zlu�U���9Y �u4�-��wi���R�A���G�������u�mTiQ%�uO�Y��da��@�{*�!/P_��nW��"\��g���GSP	<%m/n���*���]k�&��-�6'/O��hT<>Z��?�@��]U�-��[������QD%|���aOG3�6M�����W1��U���s�Z+�9�5!c��2��zA��[]����������w���)�g�C��yR�3�g^U�M��`
=���0�:y�SR��u����H�������{w�3a�R��W���>o���W�C���T=�w~�z@�Q����J���YQ�mo����U�_���
uS��m�J�4q�=�}P	i %���~E=z���VjV=��C���T��@���z�]�l��#W������U�#q�h�MV�5>�L�d]y�a�K����]U�J1�I������N��3�d��2K�7\S�����-���YZiU����$
��sJ��>�w34���d�=������1�3���f��:����
)���SM(T=�0D�<�8d��d���T�/u�:�h�|@��)�@����������P�J=����/+@��%{�z�O}����CRU����AW��h]x�� ��!��=�
 ��.&<�~�[�EyJ4�`��2�g}�J�U$��j�U������������0�)Ms�>�8P�&+Il���q����$x@�j��q�u��z0��a\�M����d���S�A,��zX����j��_Y�����J<SH���No,L����).��zl��M��g��l����Zw0bx�X���G�@����)z��cS����g�p�z��cwY���:}f������L!��+���a�P��W��2��]��aB��`�n�+��%!�)C�09*W��UHn>�jBWuzK8}��3���b�ZP/��Kb�!:���G��A�W�(�
�=��I�N{���#"���������8���YC��{��VK
�Sr�~���s�$mh�J����D^Z�~Z�d���*$�Jm�h�M�����s�A���*���Q4I9�t��a���E�����>�����?���NG-�b�5�9�to�~��ZaE7'@�^2���(�
������J|B�`P%T�s����I�m�S_���G���kU����U[c���5�:xtE���z��X�f��1��`��U���^N�������Gb�{N=�:5�,�o2�x����������#%�Gav��D{����W��R����T���2�,z z=�������G\X����<���~`�>[���s��zd*����*�,�)��O.��~0��D�q'W������G���xU
�=�QK��\/�@�M"�'���p��������us�Y�.�Yy����y�g!�M���>�����=�),�S'/���yU���.�2�<�s�����8&-L���|U����t��Y/R�;��o<�������R�)���k��&>�I�r��S�P��OYxX�u��j�Q����0,���Oc����9g���d�����@�>G��F�������q���RO��A#����5nN��IAq�y�����us��ili;�
����(^�cej9k����sV���q�f�i����o��|�5�=�G}N�������4���K�~��V��c��b:~h��.��.�P|?q��c9��H�a�z���2�����K��li.=��`rc��3Z|��x0%K������3z##��i�y��Y#�����r���&Hc&R6��rx[
t��s�ft����pE ���;��o��-kpy�mR���9r���.^���UI����WP���C�q� ����3�]�����`��c��������UA0G���;�yG�lN�F�2���m=#���wPyt�*r�2t����p�4\sR���N1���r*;�&BX�'���q�:uH�r@yy����+���'�-��Ol�rj9N;E�'vLPu�q���XQ���#��Hk*�:b�����rxw�U5�%����,$����+2z����F�����-T�|y�vO�OT��9%b�@;u��80�g�A��y�<�����].��SF�����F����<�W��zD�j���A�FCW����D���V��G�����D����L7�X�f���:N�@#��sA;R�(Nk�y0�1�
��Y���0���X�������*�����T�j��h����!)���Y�y<��Utov`V
���!�F��J��������d�����z��`��0HE<��!����HY�*b��G�B�v��[]a��z��@���}b�j�.#��x���}�7"���7B��
����`��i5�H�P2���\d�*n��!��]��)����Tg����71�=Mv�'.��x0�i����Q
�`�s�*Q�{��N��|�j�'��%�#0����CVzoF[�8r�����H����`�������B���q0����g�Y*�c��$�G'$��`����������g��7�$"h4�<[�b�;f
�Q����d�2��c9[�K���=��<�<����1^����V[2JG50&��<��d:�0�s�7��:qU��t���A|�1&��uy6�tK��:�����br����1C���������N��$�1��;
^��-��D�y0d/z:��.����C�����5��f�#5���rz��`�*����T�h��s�W=,��QT����<����p�����v�7g�#�z0�9k!d1�<%�����C����Q�%��1�pS1����Ceuk���:�H���t:�z�G�< ���Z�G(�t8�����M?���zn���T6��������yd	�A��&N�� ���5t�k,����#�"�������En���8{��D?�*?�
^�p1Ev^p��eM��0V�D��eZ�N�\Q#���#��is�u�0�,x�~������6Y����u$1��"���?9=���?K~�J�9�y.�2�65��6>��6?��P�9K9�����0{��$}f�K1�%^+��	���"SN��\3q�N�{hS�z0M_�$�=�6h}����G�k����.��>Yn����8��X����	a��:�z������G��^��,�����?�LN{>1�D�ov����,�Xp.���y'6=��~�fC9#����(�7�3�8��q#�'?C9#��*�4�F�������)!����p@���w��99��~E�g�v+�tF�J�z�SZp�3O��x��Z��)FN�[
O4��g�t������c������k��
� �5�1��\� �<��o�_����]7'����B�]��o ������k�����T����1ruc����ua��e0R�z�d�����.�������Y�����@���H9w���	�2�q����;n0sZeF*��y���#/���} ��j�=^Az�����<��d�&�#��3'�b��/?#�o]/��/�g<
	1���~�9�����/�-�"|��5��yx����"��Y���5�[��\	o���w��N���f2�����7X�`��
��5 �y���;k���9P������������_w���$PGQ,I�`�Y�����!d�����u��{�	��5�a��G4UG9��9��!�QY����9�+%��n��J�`�vd����G�����m!r��@{p����
�U��3>��*�Z��������A�G�$�W�x���V����8�G���wbe�B2�B(w_���[pQ��Q��Sf��bD��DX=9���)#��gh%�z��'g�K����y���8!��:��9X!Yr���	5ul������en4�[�5e�DB��5��R�����~9��������b��X���E��&���E�;Q��g5��G����F�	��Z��R��w���G�<|&m��s���^l��J_x����������q�5��>����P��� �q��4����?f
#����s����F�=���zc�g� (�����X�pf�K<�o:�n���d����Q��=�Imi��9u�3����V}�����L���Ei��(Ek��S�Jq����W'��Q�g�����{U<u��3yfC�D7���Q��S�7=IMa��F]�- ���@��SW]����W��F��y0��:�!A��������M��L����Vq���o\�P����s���kmK��@J�.LD:!���4C]J9���W���o�0�ib���g�p�oT�uA}�������,������S���_����k����aT9y������H�4^�.����3�'�X�?G]SS'@�����������������S6lSXRe��8J�MZ��)6x��j���"�������\�a�n�7W�Y*����[[Q�-GK��97���F56�$�L3��\��Y%FF��g����A��`��n;�(�.S���;U��fdC�JHk���-|[|���XQ��#DGp��``���yb�����M��%��{���M�������%�5
)��iA���^o���� -�
�Y�VU\�
mv�+�#�v{@��:��d�!��
gp��3t=���g1�z$v�e��64�D��
�
���������B��>���A��~[	�M]�����WA������]�R{pl�{�Z��Cl�s��#}��6�i	�%|}�A������3;�%t�Bo�W��s%,��	����Ko��V;���Qk�=u�y�k���-/��;��M=�!��9���"�GO�A��r��+H�T�k#���
��#-�Tk^�J���.;H��"��+�����>�x������6�-�������u���X���t�!�NJ�h�?�7�q>M� ke�������-��c	���;�
�dS�*��!���M0����yU�.j����A��m}�*��������~������6�>}B�;��9,�;I���
�L�t�wW
�k�a3����Q2VFS<�p'����3�8���C�8!��s��W��u�m��7�WEo�b�m?��7������6$�v[���x���J����5�}�M�L���6�$L��������Q����rh<�v�GkT��);�uN?���3�]6k�)����������D5���'��%>�-G�����(�S��YGa�-@�m����������g���J6W���[Vu~:�+�X_�s8K��w�o~�_Z���O?���~(�9��W�8�~�#��V�'�3+,��uf����[���}+�hd�����Zw��y3����@���Z'P�jUK8�uo�0_iU���{�0*^���GV����z>��O�#�@.��;v�������E?����^),��;�u~��(d������qt��d���{
��gz��\Gn�OB��i��C��P�h��Xy�!�`������#c������A���z��d���|�j��[l�W���A����V_�/��
�3
M>��,t;2�����C�cO���;`�d�r�h_��T�'_��P�����d�>;�T�
7��d�z��*!_%O=�!��������������!����Zyl��>N������`�����N96�p^
���-�����*��R�o{���@�+C���"������?��;H���_W�"���:'"����t@�������E��rY+�
�&N|Ov������x����(� ��G%q��Svp��3��"1�@����U;�
O88������A���C������ $vh~�N]I1 �g��G.�����%j���;�YI�3bR!�Q���@Q�8�?����,p�9����O]�6 �Ce�
4���6�����m[�_W^����Gv5�2l�WQ�L*�=�_QU�GMD,��@�|4sxg��R�p]����.s]ym
��
):Y�p�I��.������V>1�#����������a�i�B�
_%pjge7����%D>
�L:)�_���C�f�G_Tn���H��yf��;��+I�FR����[�����YB�c�3�E�+��������Q��x���:�_��+CHq<bX��L]w��'}���%�>�x{7���3���s%E�uo/���'����&�������k
+�z��)������������+�j����Z����{]�G�M����`	���q����
�ME�)��Z�+W<��9�ebe��O�Xj��Z��u*&U�6���;tpj��ey���I������)"j	WA��9O��F�S;%2���������$��ue����C��U
��]G@���E<��S���D4��P��z
�<b>�?�����p�9�f5���hV98��bR���,���[Bd�>������rp�Y��M�(��������6�f����5�[@E�_���r�)j�U���U-�����o.�S��B]���A��G��"�����S���A�R�;CT��zE�to����������j�P�)��Z�����N@�_iH���S7�{��<�E�g.l6�\U���F)]��v�V
���������7/)�V����i7�`t���*�<B���������VJc���R OP����]z�h��[eq��U�V;\��YA�Y9`���� �J�<}�����N�K��1�_S��h|�[b�%�*��i�
��#�^J��
V(���@J���g�Z���BZ9b�"���\�eC�j��nH����I�xf��llP���\q3��<�X��f��l���B�>HavG!����
�<��k���G����x��)3��� -W$d�~+����w���4�~�Ni�!F���rk�U��T�z ]
�n��NA�Z�{h������C7 �8 �6�z$RA�n�t��m4��u�;�u-�Fc��W��w��q���\��\Ue�1��+���@�2)��\�����R�1a��+�j��/��a^�d\J�2��W2�9[�!�^��k�:v
D[
��2E����o)�RK�*E5���9����h��RT��x�qT;VFE�p�*4l����.���8F���Ru�d���_��G�hk?
��������]��"7�+���������k����Z��t��o�U�L�)��4
�c�=mK���4�t��o���J�e��y�T�3)��pI����N���Oj�R^o((�q�#���p�L
���h0i�e�(DHP����
��T��g�,��M���h&df#��6
C-�l{�_��VlQ������:����:�7��^�;�Ttn���h�����Q��Uo|����;��&I�f7��g��,7����R�S������@U@?�Gv��*o�NF���A�k%7}@;��9�v��nzS@E"r!{d����.m������p�n���U�������=c���
v��+��B.��U]QK�����a���m�W�N�����*�a��v�4u2Xq!���(!BI�\u)W�zkK@���>X*��A�;)�mW/$s��#LZ������{��^�CV���*%������a�koG�7�"���DKYZ�*��>;��� W�G x`m�&4�u�mV�J�k�J��A��	!���Vb+�u������-qP���`�H<��a,�2X",%e�Uy�iG�Ho�Bv����p�	w�U�lF]
h;,���w`�k��d�Y�A0�C��d�m�
.���kj!������b��;��S���l��s���Q;�7�N��vP����h�Ua+;�v�<��UPt%���A��S�Qd�^��K�a��Zyb��+{�i?ak�������������hw�F�w���Z�f.� �����>�B,����f�j!%#N)��t��Zy#����f���w�Z���J�'���Z��"���a�j!��+d���RX���C���_�,�B�5�VM����r��������I��cC��;3/���yl�����p��Y��82;3~��!���g�<X@������k��B-����g�3��X/��s�c�n>p���������gA�O�W\s���!X�;S^�������CN���R�$�
"�j!���Q@�m�������z����2x�%���������`iR��t�)�`��mQ�*��=~�f��"G���=�B�]e�>�;��J�?��������{l�2������-�)l����9�
o{lc���e��x����kr�5'��2&����ZR�`aoK�T)�qY���_�����������VyVx#
]�3�%��N{t���}�m�^ O6���ew�����r���?<S�~��>M��UK�K�������r=��0g���p=�6���w�.���Z����_��	N���TZ��!�L���
3m'�[����c���{�*7;J��
G�i�����9�T��3Cq��;'��(����j��{]s$��������{@���\�+���of�+���QMXy���U0���mO<�u8��7
���	������o��:V�
M~ON�W���ZH{>s��2�T��UE�]j!�A����V1?��0��9;\B��c�z������Q��gr`=<���G���G�A	4j!{���*�����`�C��fXw������{��A�-m0����-B������6�-P;?pu��,��k\y�����T�fcPK8p�%�� �� ���K�����r}y0���lP���
W{�������6<��ew6n^��aC��PYU�W	&��|2N����!9�TA��D?H���v���Iiw��Z�&��m���68ua��|����S�����o`o�n������������	q�#698u�z
��9$
o{xEO�g
@�����j�/����Dvqj"�.>����A��������Svp��{���71YYw����cCe�T�xfcm�ws�.�cD+6���-Q'��K�����_��Q�H�v���Q#���r<�V�Q�v���zS�TW�xW��{x��.�6�^||ffzZ
�AX��^6*�>&e��E���(��
�Ay��8U"{�����X��@crD�����f��p?�B���q'�����A�.���F'R�alJF��T�I������������W������%�f]1Q?H�MFD���FCDt�j�7���avw����m�
��R	.lK�Qv�
���� ��:��@�*w�j�`�����T8���(�}��K�>�����im��^H�+�/�����Q������}5�w�.����nT���m���`Zo:��!V@5K���v�O���;&�.�'�6���3�%
�����(T���SST�Ya�9��&�W?t�Y�>�V^��G�
L��k �[\0
N�&T�<��=���^te�@�fQ����SC/��q�����28�#����T����_�����D�5�T;����KY?89C1���}bL��;e��T���1��5�����������O��� ��U�z���I|��"�5��J���_i�5����������n�0���m_%4g<q/(���
Mb��FW,��^�������m�-��Il%S4u��X������"���,�����<��B]���`�<}�{�*S�-�<���>����&�g���z>�o��z!�P�4SV;�9�m�<��U��^��u��VC/$Q��?�������b�rL��?�+IP�C�D�a��S�?cjt�L��T?4�j���3�r�`M��n�
�~�+u]C����u��(� �f�rq|���IV��;3��T��[�b,�$����������;4�FFW�6PF*�>����B�zw�j�^�I��H��V-�G_���i�
V���
R�[b�%��Xy�U�E7�2
�����; _9HuG;�f���F�7�@�'���3�B*;��e2��TO3�o{T�kh�$UF���1�y*f���0j���A��];h��k�Y���$7{��^�����&�j��:2� #69Hu�x_o���C5�[�@�n;�^	�� ���QR�jY��x��7$y��E6Eak�Z����@?N=�h�F�����UB&�����C�/N]Q��Y�6�d�}
��zp��5v"�P��z�����3N=�;���o��m
]��j��Y�2��j�3���UN=3�#��x����������j��NQ��������3��'l����������<�zUN������*�e+B���S���;�!w����9tV1X�Z�B��&q�3�
�������V�6�C�x��n�����A�$	�*���z���)e�{���C����E'1[�0�����;�#z��8�id;@���k�Px�]U�DM�
O�4h������eT#��,�A�=���:-�:T��A��35�%EU�P�n�
)���mW��h��6�����2�o9ze��A����y{^Y���[VG;P�Z�'U���i�4�(3X`�-�2[���<a�XVbN;`���{l�L���x�Jt}k� �$�B���z��*���)�^�
gy���<3��#�DM��/s��S���*�J�h��Mi��CS�%�^���4)\!2
-SI�-1f�}�@�B�v�;�'3VF`��w4�`F�k��\���<F������T�;�;�D6��T��zr4SC�0e+7Ac{B���j�:eX����r�b�T+�lA�l���D)���r �!�N1|CS�r 3���Q����`����*�'
~��4V�(@�1�-�@��
j5��-�@r������p������j��~��$e,6�����nX�	��r���	�J���G��<��-��KJs�������W�*�!�� �ge��[)AR&�N�*9�I��Q��� i2�TK�����O��79F�M�F����3,��RZ���U(�>B[�2~�,��*Xylg��$y66���aH��J ������<xH�z�5�!�:
P:Z�T��\A��(��*�@.A�J� �D%��F�]Y}��e�u��;��
���J]7"���s�m!����yB�p�������� H����r~Mxf�l�����W$���+��+BZ.F�b��	�)�����Bdt���CxP����f��*Z�i�n����Li�*_���^ib���r�k�'�=T�X��J�_X�����@��#�(hm/�)=J �Zqr��GX4k�Xy��WZ���9����;
�4�=HKn�B�
%��+�lR",��O�ub�Z���-A
%6J���3Cd��*%B
�Z����r4$��� �VI�
<�*Ao�������SF� A)�q��A�f1�t���������w���&Jw�gVdi�n��q�
5-��+b����9��r�h�1�;���^����.B�Z�����X�D��'�zEwr��)*ek��� ��lAF�����>���(U ����5U$�TQ7������bw�����9)���a�5���c���� �r49��/�����6s���AA�=���'��1�2���R��}�����*��2�/�g^Q�����_=h��B6�!�������	�R���\����0^c6�A�{���H:����M��Tq���~pk+Jj[����R\���fa�]��(��$�16q��H+h�e;��� ���>�FA�u���0��~P����������3������mC��!�q�g9���f��U?�u�\��������3HP���j���GF)��\ZG)S.���XF�s$��L��~A
�MA��� Hg<�*��<w����d���cC�� �WX�j�V6��J���������A���������v���N�4����<����Z�6�|k<�jm�Z�e(��u��Z��rQ���4}��Z/K�Y�L(���@�G�����x��!�|,AQ�~@�>���H��
A�-��"^ �Cp�����Y��3��Z_�$�8'G�q+�}��ce���)V��x���Y�Jq�g�����Y��2-�	�S4�-�����������$���z%����A��^����H�]Td���xHc���ZmScRy���~���I�r�:)�5�f�g�$��!���=�� [iC��q�� �u��~�-�I�S���3����I���]��u��������������� ����8��n{��w�YW2��<~h�H��1VA1�~pj���(����C?8�M�*����+Ib����n�����4�����&�TE�zW�&�������~)�lg/��r ��9��}�So1VED��m�!��=�^qF���B�H���	�pp�Q�2�(����sA�-��4��C`���zV���b��]9��'�I'��A��0��)=pjc���Cj@�.��l�G*��=b�"�k��Nm���n$P$��6�>��WA_%p�����f������|O6%C&��z��&����=6T9�dp�(0�+��v��
Or �P�8�!�tD���
�H�*��[���r Q&��iC�o������qB��+ok�j\WF��#^B|�oG��3`d����zr&q����MG�1�����z;��S�n8e}*������c� �[6�����2*��)w|���.C��9�����^�P<�w�&M}�88�W��
#�[i����n����#c��
z$�9�A��qp�������u��h��?�Ti�����x�b����+\]iR���������S1�r ��=g���d��+�w��os�������z��h/"�&2B�/"J��B��$G� ����y�]x�j���r �(�K�{N=��5��Uc�8�|D�������!\����n�b�C�����u8�g��8����;��*������!��26+��E����}-�Y�C#����v�#�X!?��T����6�
�D7o������������������A
���r ���)I��r �]gQM���r <e�U���S�����0Yc���!/c!����(,��F'�;O���r�q�O#4VF�P�hc������v��x<zk��<Z�|����/������3jp?+�];�5N�E����n��4/�������]��
�zVv�����2�23���/�}������S���J[=q�'�C���<e���y����GGk���L�������4N=�P"��n�V?��C�����9�GJ���!�0������l���H��|<��}#p��JxA��!6	��q8}"K����?���A��S�qp�1AFo	���E���U@o�9�-���������S����k�88u�q�v= 6����.[)-,�}���z�Pg/=�7o���AVa G2>��e��`3s�W	?TQQ����
�z��X��: y���D7(O��	!�9F�9���������F�G���
�����h��s�������4��jQ�NX�H;2��
�(�RU��B�
��q��������&�(m�Z��G��4�a����7
FG�#"����s+�����9������!v
]�z �J�c�)��@!�@~7��b`K	1�A���Z�:L���4�q��\J�����1)�c�:��te
pD�;`�z�%�Hc�6������u��*��W�u��6���.=��[V�p�kx$�����x�
�:���W���U����v����R����C��A�s�3"%�+lTha�:�d�j����ti��Vy���h�Mx�c��o����deW����RF�O�sgH��=������5n�!
R6��;,�x�4!]Z��pXAA�Gn��UJzU��4��\�ex���Ik�3�AJ&��V��t���/��s����YlR��s�L���3�A~F����U����v�$z�t-7]z���66���v��cKy�3�ouR�pZ���]��~�8%<�e���LC�Y�3���mB���5�]���g���%��;�������K��q �a�H7��-M�;X*���-9�07tZ������'L:�rZ���L��*`�i�������,J1�i����t�_R~����t#����(�5�|(����G'�^G�C��-zR-F+D�J��D%xZTC\mV��&4� �='��
n��a����}d%S'D��E�;��;���<�h+>�k�����B!y�P_F�&	3�B�#qc�
�!b�!�,RK�#�K�z�*��/.i[���	Q�xj���sM
�Ii���s_Q��F��h�]*xhB����Y�#e���9,
G�i�u�SN�]+x�Q.�D��H�?5�BR5�aK���V��d�"��jgK���3@���;%��|�8�!R��6OA�{��b!e��F����}�B�B�3�B�=cz�	AK����+�����������b!3�B&-(��4/����L�id5W,d"Yux`�3H��)�w(�����p�$�yI��+Q8@������HG��H'����[�V[\�+����&��S Yy5���<3��,$.�R�p;=M����n�=�D�N����Ub�^�:�`C�Bh]W�9��d��Sdk���LP$l��oS<��G�H"���/6?�m�N����$l��l'C�]��������b��xfD�X���-��
�	�|`��&�	����Zs�v9��������`=gj3fEt����B�z��W�'��*����s3L���rLQ��m�	��m�w�R3�������GM�!b:C,��_��7�<�R��`�m��3�������'�mh)�=�X:,�C��U+�,,����"��a�	+O�������B�Z+�
9u�XV���Zylh����	;T�.�`�s��5���iI�t]����(����tM����2�C��F�=�3���C���_��h����P��*8to�=�A��b��o{��9��=�����e�L�[E��^���(QGd,���_��MSQEQ����=�=s�@�T����S��G������4pK�)�u�KO,:
C{�MWR ��O��N�U����"�M��<�����$=�2��������3�����S&G���5��c�(�^����Q]�r��-���#�TW�I��W�1'5mU������Y��e�����q�q��)]���v�Z�,1u���{P� �Ogv)`��HJe$���Z��%�_� 
���Z����l�
�da �Hxh��x
Gd�5��������$��aK6Q5�L?5li��$��g^�V[�����������
���x�"�N��Z�����aL��T� �`�����te��N�p����&g6��tu�w�C+��iI����V�H�j���g����/��Tk���-������nS��i����������=lLH|��v���������r��i�����[�c��fO��u
0����|m��H�a�h�:O�4s)����)������C�:=�
<�a��X��Z������.!����Y�p��������J
4;{eU$����Qg�^��:\�-�m��a��H���^�XAu"�wm�����a������I�>���&��5�I�����k���2+
$2w/�v�(�R)jK=���j��YhS��E�z�����l�\�#�� ��Z'y_w��%uVl�)C���/u��Q(O�~i�� ���u���xj���_�������A"Il�aM|�i�4���Z�g%�Pwx�1m9m��sd����=)N��	������.
~)�����:�BGM�?�����j����!4��:���HG��AF��f�����#��)xdC\G���4mUM	�����`����I�0>
��y�$������kK�p:������L������{=�������3?=B�ym�cv�sZ�q.��%�#��zZ/�{�u`�j�,�A�.�=R�$r�v��������R�V�S����)~$m�%�;o�P��PkV����#�Z��m�_K�#'R���6lSXSu\���E%���?���P���{\O����8����P��2�$��_!���u/�{�Y�Kb�o5,��Jf������~�K�66-��F7]z=S�"�����.�N4���_�q2_�tj���L��-����$�u	s��H����z��H1��W�:��U�f_�8rx� q��p�S2%��K��(39�Th�����sm�o��
������y*����y��*#uf�����
=$�R�zH���R\�����o�Fk����7�.
��vM����Ra��r�-u���Y�^����w
P=��!�����w��Yn��$��yA�F�M�~��������n���BZ^�~-[x�=�X���c�n��Ljl��?�kM	$�gFU�o��;/9��^���i�O����@�mWF���B�2��~L�z�0��d����]\;E+6���`x���,0_*�V��_�y{��TnV�K�u�_�?L�Y ��	K�%W;���{�A�N���U��9��XY����J�����7^�4����0��/���SK|�If��S�1���j�|���5�c,�p��������/0���L�N~����
�gs5���m3$�#�WOz��}Sx����m��
bH�%��1���[�@L����j������#�����v�C�'Pc
���aL*+�F��~����m�'��^�W�[q\���]�s\�1>k�&O��
�T��1&��H���J[X��t�8L������
F�2�hD�A���F�+3���Z��������v^����V���'{ �n
��l-A=��{p�������@�w����
:w9r2��Z���s�$�o*��m�h]���VA�����&�����V�Xiu8�m����^��������nSC;�A�����u���x���3��0�G6�+(��MW���3�(���3�^���F�@+ds���2�����l9�gL��z��[LI��E���/�'�����F��&���G�a�{~�"�NF���{'S�at���Tj����B�mq��N��%>N�����5��Q=x�6Y����������%��L�����S�����\9x�����5���P������[��8��������aK�����u��vE[�v�C�XK��+��wIHS����0����[���S���������Iz�A\[�w������,��c�znJ������>�5p���~���4<��[���������a�*�
���V�K��6�:o�B��l��40�)����\��nC_o���z�sb�P��N�,���mO��4�&s����m�
�
jzZ�(�V
�sSy�����[����S,������W��A����
�#���	��|�)������C0���T>�G���qK�e�8j��SS�|�X��d������0o�HS��.��m�dko����G1���>�N`ig��aL����GO�a�i�NS�h���Kh�T3�����:��[�YW�� ���0����6�wMOJh��9�E80ce�K����@�i����w������U�(X�^���� ���T�"r�R�E��s�N��4�����������~����b?���M�\
��4���=@�J��P�%$|���5
���;L��1�A���������������-��i��B
��v#�������{��#������5����I��O
[��kM�"�W�{	��7U��Z:�����m%C�� \	���P����N��J��x�(D����`f���)d^.���'n��]�[��I�.�������m*��<����t4�2o������P	�����*��V��i��3;t''AR���L��"&������'a���<s���1m��@]b������+�h�N���� �p�������#�PL�b���,��~I����$i�L�|l8��$Y�c�)��$��W��
+�������m�@���J�8�6���mj���,���e`�F���w��������:��l��
�����[��+��i��x�
��X���"&��!��^ih%&I�v|��i�%�{@�i?����i���}&,*a�|����&T�$h("&����M�y��>i�	�7��w�$&m�{L�l����F��1I��i4���$Y�H�LN���$k�����
1�T��m���$�9�mc�k��X�$_+�
JLhNWE�,&If~O�j����XS���H��?6d;+���S��Jv�L�ne��S$�qv����{�\N����YL��B�:���'|���QB�:���cO�������2;p���<S�l;h����6��0{"��q��[;X��D�:�����B�{O���B�K�������uM
!o`�����K�����$�|�k����S���@$e�3&�9��8�E���I�3qo�L,&InUVL��3�nC�$Q��o�6;����\��Gz;���g�H'Y]�o�I��?� '���l��Lf�_����
{�\���8N����|��I����oBr���o{����/�X��+v�k�v�s��>���d
�mXE��P����n�Nho{�2'}�����C���s�$vV���$��8�
`��K�����y6bM�I����~m�� m�g�Q�����I��8���S����MZ�V�������J���p�b1I2��Z6���u��
�S��b1Ir7���-D��M�$���z�Y���Zy�P���B�2c�ZOg��M�M;���d�R\������O
�`ov�u�:�7ueS"��z:%����6�(�vN����$�=|Io��U���W��$!u����t�����
=�>g���zE\�&VI$��t����H��

��V=��D����!!�b�����F����$�3���:&Ivq4�<���{����:��XL��
#�����rn�	Fm�*��O|�x��nN='o$�.U���L��lT���$q����,T��V�p��Y����� �I�
#���&j���-y&����R�U��u"Q���R]�~�:���R���=�eD���e1�V�udQ�&�
!��"�&9�[�8�m��V�R��V����[n(���4j�������g��oz���1��!��Y�Y���a�m&�
�m���N�Y������c�G�b�G�w��#gqEv�������gQ��cG�N^LA��:���������4���>�D�P���c��4g���|�����R@�d�3�U/D���Ej�_�||�Y�M/��N0p�qM���B~�\���u+���0�]���!>���b�}�@�6�������CZ�=�!���N��o
��h	�C�$9cl�3m��#I�qlU�x����
 Z�+F��q�oR��u��o�p��L#w���>R_����h�~F�3
�5�i����d�5��`c'#��1D��<D��3��*_-X����P�������V����D,�1��_;D�9�����X���k���_	��(HI
���D�D,��J��<_��b�%�~E�'��(w����8�I��S�L��T����;,	q�;
*�z��;|u�;C�}dT��i���&��9?M������W����dyK�-=�������w��7N����[�8u�����~�����:�Su�]�����_{c�P��fw���?��]|e�'gXz���'��p�Q�N_�L��=�R�q!�!�l��t��P�
��!B}0G K�v�H�H�\����~�JW*�	�����.�{�����b�{��:{-
Y���3�kx�]����������������E�M0�WX�,�0�Q�:��=Q]jOU�s��� �x���<4��!U����-�]_���)l�WR��4���G�G�p�@������+h�D<tH�D�,���^���`i�p\q����85J�+�����T�V����C���[Y��8L���m�+�BC�.]�u�[B�����p��wB����&W�{3�0��Rs��W���g^8=�����i����2~kSe���;����.J?��wXz5%9i��
G:$k��IfXw$�� 
�u���"�I�)���2GS��a��[�+����w�O�L<,a�uM����m������P�'���>t��
��qe���wO{���;!I�W������ ��2�O-�1%`.!C�3S��
�z����<~��,���I&�a��@_�C��<i�� ��!C�b�%2U��)�M��y8�W�{��`�[N��7c��$��\�j�����Sa[�M���=S��N���WF���;�������_�j�n��N�2��h��_����������(����
���M(4~���W�{��T����s3�-qN�s�����5�9q
�������q��rt}^���XK�t��E��|�vp����A������?��`�����JgM�@�L��D)�g#�!OJ��G��t����sS�!�����p2;
�L��{���x*����tmi��B��W���-O�-��te���h�<�ze�7J����+�v��+��(�g��<�p�r�N���n��*$[������?�JB6hUE�_
�;�v��;�k���[��G��`�O����:5P����[.1�][����C��1k��^
@j��4m��2�#����y`���g��82�5��?���0����q�=���u�
��=��|048����h:�B4��
>��m�ti(�cxe�$��&�S�C\�Bf���U�wKT��1)�Y?D���QX�Q(�����W01-��#]z�:�{}
�6�<]���An���{�����Q����R)��2w���m����i����c&�C�����W���p��-i�J���1��n�~`�U�����on$��e����M?�=�=qXy�N�x�I��C|��[�$(
���+�s	x��+�
�JF;S��~�o���|�z�����4�k��"K��6�(�Z/�=X�"~aE/����@(
�Q���[�_I
�j��������Va,��W�@��f�Uc���=d���I����+�hk����)�t��.����tDf���s���q��W��B�Ne��w���.{�����lt����%$�2[���l?$o����@z*^�� I,/���t�z{5���������j<iW�ID�qm	��j9�D����8.����p�5�c!{�m���@F�G�3���YA�9��v�'�3W�4��A�`oz���X1�[��)��b�e>7<������}9/:r�,Y����1��Az�B�����;7����V.*9oA� Y�K�P$�G7O� ��[��\���du�U�Zgy���@��rt��oj��m� R"�C�%m��}�g����5��:�cs���y H����d�g�rj[���'���/�=��mMK4+��J�t���/h���p����!%�0����������%����`~O�Z H�2>�(�i� �~M��P��f�Pc2"N$��-m��������(�A��q��r�@��N
O��A���u�0��kH HVC{�#@��#w�v%	0!�bnw��
#��:�P����i�ujQ23�T��@�4@\��9�@��=^�*�:BH�����QL�6��aC���#�0�j<$39��Ho�m�So�Y:�� ��Av���� �H�@t}z H�����h�"�q��SgH��_����m(
��;�-����cCu*�xH�z�� i����Ui4~$,�Wb3�(j�I������[����t�'��A2����F��"��#�avQ�3�v;��p�E<;�^<$�.$��:|�<x0H��]
	U�|K�-�sb��X^|K�c���U���=����T�|\��FT�=(�T���~=V�����������2py���$g'����^�v�)h��#m{:z0�A������L�m�cSK���s�������������H�����N�t=���Hb�0C[)�����Y��vl�#;�X��B2�Wh �`����VsP�",�3F�h�(����F������M����i�����6Wj}�,��o�zfz��[����WM�b �����!Y)�ZK�C��mC���
#��	���4}��_9�,�X�_$X����U���&:�����E�Y+vM�'�Q���l	���O$������IO^�%�*����	
��V�g�@$����g��4p@����l�U�V�N$�����������v�pK
I����u�`��[��]�C!�J�u_f
#�����J-��,��_�o��jO��z�Lt��y@H����h5�^�����G�s���38�B��x	�:�3�!��Lh����9p��82�!���x����I�s���gy H&C%����3�����
E4���C H�s{�6i�D��}����� �d�����9gqN�H�n+���1r�Pe=��-�����t�����3`Hl�vY��r�V���cW?�Xx��,$3!��Q�}�#;],K}�
�+����<$���4$fQi>�����M"zK�A�d.���J����9�"� Q�6��H7��~$���B#��[�60�*2��tdp�2_�����G��kc����V�'v������y���l�XQ�Y3y�9�?�&J��F�O=���U��g(6����&�:X�8��SQ[����:^8���^K�$�I��O�h��3����*"�L��:4RB������?E�Fa��[�%d:4�)��cCps}�c�?(������dA��a~������U%Xp
�~�o��6eU��������8���@d�����>�Z�������Gv��^/��}�|�%#�f2_C��h��a�Y�O����m��d��]tA
	��M�o�zNF�.b�>R��EWm���x�!F\�N/?j#��{�����H�-����{��?�r������{w?�������\O��v���]��V&=�U�.Q��j��#;��=�IM+������S,���	�!���18���G~����������G�%��8�	j���me"�t�9�i�z|�8�������7 ;^8�#W��%4��GFA��BVF:_��-�5�D��J�$�u�q��"�-�n&8Y�
�f{�����7�a�����R��X�bvEE�
��.r��-"=���������A�H��H]'#�@������/=�6��U�zd�&Uq�������P)c��#;�j|*^^@db��w�������#;M����J�0�L���ATT�&Uvz���B����~�X�[��f&�5U�~�RG�
�����a+�q��Q�8c�n����Z7����";���DDK��#-H`A
��zdb���_149>�������t��q�*�;�9�n��i���Nl.������Z��AR1��L����_Cq
�Hj�P��~n������'tyK���-�M�G������E�H?���T�gV+�F6vG����u��5���69�����75��_��t�L�l�R��#���mt�^�����[T��[���ke�#�g�!�����k;��j �a�k��|$���N�@2����(�����)�#�h�u�����8�s��������8��]���z�n�kY�:M���LX[v��(k�e�rw]1�&��
FF���6�%��{������P�T��cJ���������W'u
����������*�� �~�#�:0�w�ah����t���0������q����g�����ks�~-k���K�������w6�"GA������f�q�8��I�'r
\�~��@C��e�#[�D��R�q����~��Y�(�Q�Wdl��\t3G�^e���q�����;�d-z�G��q�]�C
�Q�a�H���0S���7���*lL�B(�8��4p��U���[��M[+=KPh����{���j���R�*��w^)@$Y���aJ�~	,<D@4Om��_�tu�/�wY�\M���Zc
��GV���V�#C�/�3���0� ��1��	�4������u��c6"g{SJ9c�j���Il�K��#4��t?k�P���w��Y��
C��"W����dJ���[��_�0�W�^K���N<�I+<�� �nes;�f��`��������2L���i����J���;;��G�ZS��?����4��o���.�,���[���C)������k�����/�Q���^zyA��F
�r�F�G��H�4���F��D6�� ����������m+�}��R��r{�I�����
~g�y�	��k25�cy�4Jt��wxg[��U��y��Z�*[�~�(H�Lx��:B��ed����������+*�6r��U�#r�>�'���G
K��0���d���q5������0��5�jw��#Pc������pE��b!�jl;�L��?����[a��� ����#���"���t�#%�dc��C����"n�:���aJ�����P������1�OR�b�r����~��m(��r�L\��%��?��&hY	C���31S�����x�pE7�pG���-B�G{�BZ�����m8�I�j����D+W��~��L���^#{W����y�#�:��>V*��t������H���d��&��[�|�q��5���pC�x��U�z�
���������u	be��mpG�70]u(�����^�
�����+[�33[�b�wd�	�+v0��;RHa`�-�������%\$���-o�>�L:��N�n/]Sy�pA5���p �n��ou�O��������n�D"���U:F3���d}.����L����p&F3b@%3"�>���&��	������1q!@}���v���w�/������`�S�����*�@��/����'������Um
i�Yy��[�n5c�f���6��Qe{QR��P�����B2��%�������35�G���Ha�t�!V6�;R�`V�Vf�����=Y�����&V�`�����H�.����&�3�>�z:J����T���XE���ob���E��0\�s���D+�����V��J��{�`�)���m���MO�a �?�����,�nm�]X	:;r�czz�;���4�q��=;�m�U��]��Y��~�o6��H!�x�@S�����}4&��sK�?�	�C�'klxv��\�2�����I��F���$��*s{�#5�����gn��(�H�n�:��e�R�&�D���,_C����v��Q���%� kf���3yi�w��X�V����?�f���A�<?��M[4L��J5!�<<�l�sf�Z���VBs���7��DuD�p����u�W��q��2�3��>w���xe{����5�����:��������Hg��k}b��D��z���0�0����
��=Jt�AI��:����l��q�T�7:�[���r�n$h���Hf����:�m���=:[q�����X�'�!(Kz��[�K����d����l��YH��������T0bHD����X����Z
��X?s�I����z�����FO,�����E��$�
�PG
<�
Ugp�3��{#����Y�������V?GSg�5[Z�mwv�������h	{�L9v*�2�T�le���4s+��[�>3l(�0���������,]Si�`n�N=V�����V���������������uj[����O6xFc����l�6�v��u�V�%�)�{FUmFv�_�����:b�$vjRXeAi,A����+[�����W�4�bn�z�&�Z�E�n[���N�.�ezj}A1���f�<�yA��:��Iud�L��3z0�A��I���|�mC+��}~O�=�[�W�$oWMSJgPG���E���a�������V�'l���6�y�P��'�#�*l��V�DD�`�������mC��f�FF%mg%8�wq�	��U�w=6:���>1���S��,�����Z����������0c��#Yk�{B�~L)�Mm��x�mD>'�,8@!����lj�{b���y_V'!2MYjK��j����q5+��8��4�������������Z�w�q�����Cf����h6��f�z(e^��������bn���������O�\������t�e�=#c+��m��fMGF��s�@�[�K}�3t�E��n��0���]�Y�PB�:��^��7~�����Odl_�.GC���F�
Y�5KS?j�]?�;s(>�%m�z�Nt�i�u�=�`�2R{	��_�/��0����,�&C*�H7(j��s]��=�8�YSPk��	�t�C�0��������C�h�f��9�.�3w��#����/_���J�Fv��2�W9��Rj?��"�x@x_�M%��Su�b/F���PFj�,"���-�HM�P|��f(#�A��|��/laIdy�!�s�32$G�$&�%<3�i|�o�e���_��4�v�[������k�d6C�[�>:�H��a�u�fo�\��}��5X�����s
=�ff4��5m���cGl�|��f���9�����|�~AF���O�������dM�%h�{ #K���J'^G�������36��]i5z=��^���\�O��:��������k5����^M�����������jj!a/QX������ ��Bw��
~����J(�������x���H����
CZb�V�4���/;���'�~� ��Y�0|����"�:�\Y��z0�#�:
���H��Gb�*0�l
"r�0��q &&�p�\�U�Rs*$�qb$�F�:O� 1�3��Z�G��%k*ld�|���.0&�v-�W�B�0����.��z#~YC��;i6�]`��A��k3	Ce����C��(�u������U�2�SXu��0��W���V���a����?��7a�/L	�C��Y��.DFr&I��&�>t#k������C�b)�=!W}�C�DL�n�}�t~�����E�6���l���HLy�^q���	���������S*������aJ���C�~��H���d��.� �<�]	p����5#ya��	mhS���M���X��0���xa��ad9ZZO�� �<�
A�&2R14���!�" ^$6���Z����j�
k�m`�pGN�#�8��$Yr����Tj��W>Nh�F[�@xN�W����kEZ��s�!�'�2i�!i����U�a�`��8K��D����=C�n�=7h�C���C�A�'���� S��M��S�_)��
�Mo.�N��4���I�
�_����Z���������U`����������]���
:M9B&W8
�Rmxh����td�^�s�����b�.O�6��ol��P�Bw[��1s�\��u����v3�����?���^
�q����C���Um��E��v�"���"P"�H��|���M7�Ha���t$e�X.gWb�b%���)�]���h8U�������C)4��q�y?���0��Y��s��q�"K�P-�I�������A�6u�o�{�����7��n������I��Zp�E\+qj�Zqx�����}�,C}���<�"�S�T�Tt���
����53T��A'�r��f�W����1D�@��-\$�����c�
V`�"S�vk��H��;`xRC�9�"K���]B?���t���\dd*�n#K��r�|��
��$:u,��� �H�!p|��j	�������NP!a�]$����4+4���E>�&�Os�&W��a��z`����
 �
#<x���2��:�Rw�C��@w���C�Rw�	�mv�{���B�.���e~U��TVO
�u����/�� �Y�W�^�t$���R��bO:x�B���&���Ha���n��|�"��(�Y�T0��E*D#*����,�^��U�\�E�bE����.�����)���n��"m{j�p�7,��s��Y�M���G9�����T;~��������z0�E�p3,T����n������U*`��D�g�+�U�/�a9���]����$
�,�	b����-r��

oBX���"KC���9��=/R��� !�V�x���X+�����U�J�Ej�����3���k��o�fx�����`�?}��������
���B�n�3�j�s^d9��;�a����k(X���2�P�E��,���j���<�mC�m�6����\��L���mC���T��t����$�o���u���`����=�{:hd�	��IHNdg��[����,�-����f��p0�P����mj2}]�R�o5r����z�3;�%xQK���(�3��u\e��=���
�Za�=���d�1CQC?�y��an%�����a/WukJ�E2�#&�	��r�"�n�����#�?T
���u�&��\*��t�������_y���C�|�l)�O^��VRd�320G
y�s|A%��A8?�:
������	b���{��������+e��
������g��!@�/x���%�"���y�Cx�]
Y��u �<�r���,c ;%���D����c�����!��`	���%�"-#����2�U�
9���k�J0%�"nD	G���g����VB�H�D�xoD�\%��m�9lh����6�J$��Av�����Any�*s�&[����gA��������0C�gd`'��$q���tr<Gff��+���l�,��<�x�A5b����wF��%1�B�3����6OA�8�-\�����\�����pBn���+�<m���GG=O	��3r2�&�{Y	��`5�`I��w�������D��x(r�`E]O�-\����s�S0�"���)t�J�}���~_��O�A��r&U���q��um����X6)��L�Vv��x���qa12l�����i�l�z�ZJ��`8����^
�BB���'�f�����Hj��'P�z%�"�a�{_�E��ug����W���������[�CP����^d�B���/���-V�UU��+{`ng�-�'u0|�-U�B���bI�OZ_�EV*���
w���\���{v�:����4h����n��f�������`_��'���������n���[G�E�(����g�R�|a�.',!��v������r�"�����0"{y��I��~��S�P��x����1��w��������UL�����s�:c��m�
Uz���=,��WQu%8#�
5fv:o�Iw���8���v��d����Ut������SW�IS:\����3�"���IV�.�D�$����i��o/b���x$�x�6X�{U��x��#�`��g�>T��tX�i�^�JB��6�g��H�{�~���\��,0��$�"�h_'N+;�"��N�`dp!�����~�Ca}�����u���z��g�uj_I$H��)%"�"�� �	�=���S7�H�����y�H���:UN{D�l��-(��>�
#�

j������[�E��+G�]1�f��������[�\H0�`
�6����	��6���:
p���P�S��K����U��9h���*���u({��l�����,N���EV)�R��G~�m�zN���n��b�-��nBJ��Ptr���
����EI]8���42�����,Y8���jw������q���7�EG�E�]ok'�z��i�����F����`�tOQ���V�����:��I]GF����'���~�E�u�
����n�I���]�t���r"�
���/s�"���
')����Eru�Vv PK:x�Rp�b�.J|��E~�1Vh��O��]��_����H��iRn�Y�Eq��>���UI����������T�������c��� ���u����Qj���Ym�?��]*~�n���_�1Vc��uK������1�����Y�H�C�z�56?P��^t2$Pf`�.�{��eC�plJ�����l��He�k�����E�@��F�^���E��V?o��Ak���^���pX\dq����	�����i;9�6���J����V����-5]5	|::V�4`��������7�`����;:�v$zw����R�&N6�,����A� 0Gq���u%Bz!�\�:R3��o�J��k�����E��d�}�����9hb�>@*9ch�RgN�U^�y_�����C�j%����<N7�N7�!�'<h�A�T����������!�
����"�Y)����UD�`�,�080��t7`���qj��Mu�.F����f,��*E�fI�q8V4}]5.�ZJ�����p��
��dy�~������OC�.�.�� ����t�"q�"��s�9{�S2��+����Z&
������F��O�������u�V�#��z����D����h�W�M��*� �Z�E�j?���;p�\pk���h.��)�
^�^bp���g��v�"+���	�	G���,�e�l:��f�;��[��[�ER��!����s���������]���x?�����o�Y�,��pS=h�
�����v���������=�Q6�dP�6���,! 7�k��������B��5�����������<����LP�^/������u�v��C�:W���#w�Z5����0���{$����U��EV���:O�#n��P�3����]�|j%/	�������!;'k|���=*/��!����2W+�IoEuk�nY�
q���Rw�j��IC�Wo�5�}�]_�}��tXld���d�����q�W&,��s��������jSC�������\����I	���1���4�����j�I��`��0��&|����r
�;7*FbR�S���_( ���Il���
X9/��m���z�z��t
+^7dJ��������DJR��b����R�G��G��k(��qbXXR��F/��F������������J��7�"�������[���p*��}�������B��s�v�(�)Av��1r�`O�>0�_$5��mW1�_d���pK� ��1vs�v��5�"�]���U����u?��2
�fs|����m��w�Q&
5�����8C<�N��P��$w��:����&�t�D!%��P
��z��L�"d�S��t���8���z���[nc6����E����d'��T���l����X��x�>�J� u�3T������l��B�����
A���m��>�~��^�SB�%*p���[S��s&�>�12�"+^Y��������,es�P�qtc	�;/��ByP�j\=��6w�L@�>kH�y��ZJ8.�[��9��2F����R��#�i��F�vk�t'�7�G�Jw5��/�C�-(��[��h����>���rud�EL��rc��'[��,��IV��=����:�����Un�<�W��l��Un����e��5M��.�;�"&<s����UT�=C����,+M�#8a.��o�mU�Bk�E���Y��JT�=p=��5������Z*�yfc����Y�����,�R:p�d�>3-^��p��H���
'0��G���g���P���]K���7cn��v-�e'� �].�J]dO 
�q5�"F�
�'v���tM~]#����1�����>nY�wGy����U�>��K��#�
����X����E�> �TYd�����,�@&���U���w�D�-i���U@
�������ZY*��*��l�������S��������`-��{na�z��a�D�J�Z�EVQ����2����V�Y}&�
Tk[�����eD�GF:h�Z������l��
��
_����P�RFd��b,�l[�6�������,zwn[���
>���
F-�"�������6d�{��G��o��{$�����.���9�$��;k�-d�0Eo�/g�@��tOhi�k�`���[�EV^��*lU�
��_�C����\d%�)���m�[�E��r;���KW��[��Z�E��w��F1P���gz���]V�MG�����`(�U��r=VmE�Z+���V��2K��=�
�����b�k$g 9���{FFAvB;�����Z�E���4�*-�"n�������p���.����������j�����{��\�
WG������zG���U��\�g�&����kF������w�����l��������eK��g\da�u�F�V��R�
�{�����U���y��0L�'Z�E�i������j=�f������6� �����\��.����gv�E��S���F[�Ef'p��1����#7�I8�
ek��<���p��d�U�������U�F�����������'l�zd���b���������U&n[����o5wrAq��B�~
��U��s6T2�����!&9_�g�-U��r����m+��%�N��7���P=�@��e��"�BJ��)�������g�H~f�i"|x
�H��*��(a���3��UK8p��.��FC���g��"���iI,�����z��.�'vj��#�"�8�	��'l�����ABZXm��6����;/���������a��,�����f(���X��I��S��rO���g"��6��+m���X����i�{�$��h����*��u��{��
w����T����f�H���y����#	�:��>p�i�tp[���������{�D��pD��o��3��`��%\d�7���M�oY���U@��-�"=�r�Cun�����o����o�[-[��+��:��O������/������K��T��}���j����N=��%��b�[��o�#����\�/�zUG�
��DqV��.|H�h��#p�:�����)�!8�s���mh2��o��o��W����T���[����j�x� ����i:�Ei,�#����P]W�L���

 o��������&�dE5���
AB�G+G'��4'Sr�Xd����������6~P�#y�E��jm��+��mDs�������B�y�����[���M=Nxt~��3�z������T�����j�J�\�t'b��j�����������F#=����W���S�T�J�c���� [Z�G���2�,��+�DnlGVDAy9j�|+��T)i2��jMsU|+���7�{��B�l��������k#���y�O}K�������8�����J �0ED����"����(�X��Zs�r���Q��*����P�������������*�hZr]�������ct�h�����Ft��I
�z&n���!9�#��W�CMG6��y�C��o��3����%j"��D*��
������8���T��?fT�2��D�u'��-Y�������[���.�D�9�XCi��Z8f�Q���S<<����Zc��������k�����I��OQ
Q���c�rv�/��<�8�F��ejx���p�4��i:��J �d/|�
����jM!^_
�yq2-�� c�jG���KL���-�1:�SZ�����(���P�;���
��V�{����
�����I�_*x���1��(�����|]���]���7�3��y
1-������X\xU�jG(���,@����/��l���Zm�[��>��
�R��A?4�U��n���CY�W�0���D��,�G(����X/�PI�f~h"��O�O=n��~�Q����E���:� �#l��nUB�U���a�����V���K�h/ZX���=�Q3E�\�M$]L�J��~�����t�{���0�U�_�|:��AI��,S�D�������wJ?4�A_�2`���&�%������E@��Yk
|�1���^V�M<���itx�m��B�J'�4y����TI��$:;����fh����]k���P��9rF�#0R���*�����UJ������f"|�D�,���9�a�����k�����e{O
K����D��G��&�.r+j���C)��2��/`�o�
�6�M�B�r7L�h�CYu���-M�!o���F&,��L���J�p��
�k?�����F��z�=��d���l
U��I����|�{��^0����2����F>A�w���(��>����Dm��
�np�z�!#,���%�+��1�����
~��lzP"�%g�o��k�D�����;i�_?(��#����c�<;fG��,]5���	6����z G�9�GN��r���"z�pIg��UB����=a��V�iz(��l[��p �9��������}+�uM%ZWB�z�[��k���Z1�n���5�����A�7P"s�9X���=P"YId�/y�g�����:����L%@�@�v�n��&�6P"����D�����5�����-�`�ul[�~��!�������%�~y'l��o?��9�q�4�Cc3����[���P"���O
�H�L*�!1���:Q��}_
Df���p��j*R�:w/6X@�GG��e��jI�y�p��������Y'�0u����/��A�l��Q"<��d�*w�F���
~���	������O��%������/�H|*.�C�M�8��4����&R���i�������������g��������F�N�Zq^l�������(��@M�Bw����

�5,�-tR���8k���y�u��:�`�Iw8����`�����D��u��E?8���Y�o8�A=�pC��tp"y�'�Q��(=p"KG�Bj�_������@o��\�4����Y���ivb�>�6�����������=p"��
���*j������~=p"�����X��>p")������'bv��	�\�N�6�g�>����vGW�6C������j�!X��@��;I��Y�������I���09f(�H2�<ijQ���$ �*����84{���j�F?8��F�T���q&B%�N����6��PW\��gN$c�y�z�4�i|�?��@Ap�K�Yn�C�N��&n0h��'���$X�M����^��v���������oj��EGFW���V"��1r��h�*��(�vl9{d���3a���]��,����w���!��wq��Z����b���q"]�)��L����d?����G����@��g���J#�?��nH.�*6���,�!0�j�
��7*�*�>��0��D�8�BQm�n��:���������A��o{=�S�����q�TlN$O�F/�
-a��J�����`��8�����������3���3p"����Z�5J��E�W�~��
���5Ru�I�k���m����`���H�6��'����:B���c�d���C�Y���9Ro#p"�'� :@��8�����D�����2��a�~���a�:��7����612�����<��O[�����}E����8���q5�m�r�����L����c+�������F
��V����������:'�bTZn���x����*fEl+�?��>����a�q�o��D.P�~Z�4'����;Q�7'����[�`�q)���?'�������
�v�CN�q@g�g'��z���aNd\���U�o8��&7�N4�d����H�+n��[Ze���U6����?6dS=��(�����^��#���Q�e	[��� �������]&���]G2�oe���;�CU�[���P����������e����������O���B�,�	��?	�v��U��g���(�g�
���5"6�R���:���~{�\�;�>��	p"����'R�m���]��?�:�
 �8�R��9���m���������������W����]��'|��S?.J	;OX�����<���������0������S��X�����:�:@GVZ~g���u�05:���x�#��J�D���o����]���z����:�H.|���S�@�t�n��8��bT�/}���)��tiN��H��Jq"x�������f(�e	����
fh��u�w�D����u�?��k7��U,M��#p"W���>'�6
��x���'����H���6�.�a�g��������8��8���,��8�����
��a��),&��@A��8�Bo�_������O��2<34F�H�zf�h�hz��B1��w<3��������*����x�c�DIf�D:��x�M�mh�^��jr���x&��aM��mC�I��2�N�������>X���5J|[�'<���!�����#�>�����w�)8O�5�>w�02J@��W�%dH��{vc��*.3G�c��C3�~rAt���vz����9\U7�AYefZ�r!�d��AY�|y�`j�[��+�\������P�DV2�>i�����Ni2c�M5�`���'�E1LQT�������&��3e���T����^�WK�l\U Z~�����u�;�73���H�������zD�"�('��j�;�[�~��h����1��R������������d��
�k��8+���
F~�[S�����D
������=����5�B;�@�#��[�~�����	��<0�,��\��/l��y�pW��W�������wM[�n��v��X.8)[��~]�������/K2����5��)��nH8�c>�d�o����l4f��S4n�����C���Je�H5��W�h=dN�ua��&{����U'9Y�4�t0Nd����}�!Y#���D*�bx��'R���Z40�P����yB!V3C�nD��D��������-�j�����wmY��]�fhn��f��qYY0Z" �?C��'�z���;��V���h�e��NL;j��|�h��94����������']}(��Y�����4St��k��y��n���2��-`?�
�L�]6�|�)B����<gzlI+�Qi������s�C(�N�����Ld(+�����$���A�N3x"ixJwo�p�I=����?N����a�F��yp"s�}wn��/��U�
��5��D�_����F�`:�Ec�4�0��;B,B��@�4#c�V=�}r�Y�
�S�3-���'���T~�Fh4������=TTD��Q#c���mU����\3��\L�������t����&X"iu�����K�1�B�Y"j
��A�h�+�~jlJ�h���;0������J����v�W�ae�����tujL���<cW���zZS�D�����A�VO�gh�J�48.�`TY55�v���O�H\l���K��/m#���"<�*'MM)�=3
��dL��u�����k\=�)ZIM[���9�+u6��#���HL��34��0�����V�e���O��t+��.���Jd����#�`�J�h
�_���_5��[E7��(���-o/"���E��.��Rfz_%/c��|���mp�bhXS����46��}�[���6d���	��]I���2]��#q��Q�����)$�%��������]hgQ1�].���V�kr������$���G�%����%G$Y�j�\��y*oo���
3�����J����f�b�a������^�UN����<���m�%�LZ�Vx�i�o�:w[�E�g�<��������[g�i�{T�NX����qP�C	�ebo�2w[���,�a�add���c�f���^gtW��1C;#;e����u|�`�m�SV=$��o�%n��e��P]k�s+���F��$�c;
�Ha��U��>#�����@2��>,3���}��tX"�8a�jI��H������2.��9,�F��`��bh���O���j��o�{��E@�LMp+����9y���n���wN�fzR�m�iC�F�}�a
�o��W%�C�RGu�6$�J�I���dh����/����]�V��6$_�����3!)����j�c2T����u�>����6�<3��'C$���Omx���&@"�n�kh��E�7@"�h������r�dOb������P��'j�$m-�+[dg��mK�=����njI[��+;Ck���+���O6{~�P22���Sf��}�}2�u\�{���V��C��l��oy�>��FJ��|v��=�K�p��S�V�G�r�����
i���X14F6<s��+��Z����_�������'t�F����P�="7(A��v���Ls�s���D�Y�^k�OZ��2r�:VAL9��	[������"������W�.^������r�y���1r���������##?�"'\;�E�����������U"?{�<jNx�������mcN��G�muO��6���w����v��i�6�r~O���h��[�9m���w��{�qew�*�����S�[Z�)�]�=���i��.k�U�C�Hk���W
�3����'��m.4������������A�=jvf�k���R�Q�Q�����Rk�a"f������n�������?�CO�m�ZIx��C�����~��|�����C�\u��!�����C�?s:��4&��\�3��A�����aL�s5Ya��Z����������H�������������/��z����Y�[i�z�����;0M	/��������5��IB��^1�:��aM~m	/�\>��$��U&�hcVG�X9*?� d?>M�3��/���#���?��r���9�T����g���������*���j��]!;�p=��u��1M���`>x��th�NZ14T���.�1���z��l�Z�c(�7(����-
�������
mkN`��v����q���zd�u����z���2���`�iU������Xf�l���Kg;�q�1Ewn�]l����	kr2}Gz���yj����c�7F��K=�#��y����}�����]��(��p��?���&a�F�U�`����l|�Vq���sg�/���c>�B
(���X� u~��3�����������2��P>��(��}�p$q���-:M0��/I�%1D���5M�IDwYB�N7D���H�.���~W���jJ9p�e]%�T��j��Z�+��C��k*w^U:�r���'9i�j��JV����^�'���|��}�����z�����da	�M����1Xva�Vm��[_74��z���_�#=4��*�������aK�#�|�2�Xj@4G&��p�C���8���.�^������^�!1��d�����^�ep�w�u3h:n�e�L�Ry�^��ZSt�,��!�!��|��5Yf���>w���.����L�'��~�&
�F�g�����t��cQ�)���T�0����V	=V<���8�[��5SBF�W�H�.�^DW���S�T��[\-C�Z�#��\�u��
%R�m9�bK%wl/:���]Vev�|������+�F?B����i��Q�+�����\M}��i�3��q�u�D2wY 2��������5Y!S���c9�C�jZ/6�g�w����e��#�5�@��9dY��{?�Uq��r}�������+����>c�3�s�]At�Mp#����o:����	���g�v������"*�SJ��&
���~��:���������P#�%p��|j��D>��*��Tuh�f@�z����Y8�3�M��[k�pB�'S��4kh���H���aM��?��'U�JH��m�`����8�Z��_P��f�\�*6DsR���n	�.�):�iPd;��t�Yr�~�U	��#�
U�U,������&Q!4�aUBL����"B������mB�����A-ZU�
��T_�R��[S3Hy�������U�9����P��c�Z�����W�G�s����q�q�/!��:-u�efS�=[S�Ys��V�D�H[����{�K����V��L���"��!wp�S`�[�^���
���p�����a0&�E�1-�Rf)!����_B��m�,���`��'!����^W?N���L�V�+����]M�04�����k���Jh��������x��dc�����_�5]�5��%��5u���5�n����\����xM��\b�kX�����01�R��J=^�e:!�AVk
�����t�����r��W�v
��1�r��\���N�f�hKh��RiJ�E6�PJh��7	�m`hDzTS*s�1K��������:�?4�V�Y��;
���F����g�$Bo����#������9!G�izE	
�y*+�jd������Q�<C�513#!d���C�����������=r
?��A��8#����*���0uF�*���A�a�)�CS�J�����y2c��+s	�Sf3�1H�-[�S�W+�2����I��"����
�]-���04�iI���VM��"x[
v��������.�����Z��:�	�
b�L�jrnV�1�_�����CcgB�����{�Bo�����W
~h���_�UB���������$��X�����e�Zhq�\���
�I��	��B�*��B��9�*������c#Tp_i�Zo3�ch�L+��k���d+hb�:����g�.B1��kI��[%Dp7�;�Y�g��7Dp_Gz�����?5�i�{:��}�!��
���:�x�0�5��TD�Ag(!�?O
�/����"�?+��[X�U?N��?3<>g�C-�z���c��
�V��n�9��]C_�_H/�j�q�*��_�V	���>
7���w&~����1y��o�]8Dp��J@d�`��cLY��eL��
C���[�#�`�tz�I���R���^K���s���^������<�y���(����p��q�%�HK�z��>�9� ���������5�D*u�pK��������i����i����[��^����~u�-Dp_M�����-������������B�U������x�0��l���t��-Dp7%��R?�S��$8�yX���~����u��D�Bo�|���
~�m��R���5J;�U
�����!	U��$+��f�	�~���}��Pcj��1��K�_��'�ph�����;�BL���ln�
�������iR������6���X�������a�\�!�7cbB�kh���D����3����-���[iYe1������{�V�#�:�-��#	<-����@L	C�>�/. ����D}JO�ep��p��3�8���?$����A{Q��B/���tD
&^8$��j����;S�L����x�+65�g`!��T����pjh����3U��������1�2
������D
�k�W$a��#q�YTP���&�SwW�_�\;RwM-$��4/�/�gk{��4&
2X;��U�1Uxj!�?C����`!�wd������^	�/n�NS��YH����wB���>SH�}68t����>`���m������r]�����_;����!�0-O
qK��g,�����32�sk�d��!7����-�F����5p#>$��t��I�����S.`'�5��.`'�����6�v2��{Y�z��v��#�W�hF����g��kXh��BQ���4�9�d%�!b��/8�;Y�mvG��-`'��D����q`'���f�vB����v��>u�Q<
3�a'�r�|����8�cOl�v21C;�p��6T@�hqVU���x����5�,���^2w�u�����������?��|����a[��[snY5��v��y�N~O���;��#a	;Y<��4"h��v��3�rc�S�6`'����
X+��6V:!=x���8��������n����k����
����v2���'h-��a'����r>{�N�'�z�H��n+Z8x}_�0�N�('5(g�^���B��m��c:���>;���C=��U��CQ7T���uR�������5z�x��u�	�0p���+SD?
��E��zP'��p����[�����"h���:u
�{U�_���-��@'L�o��T_����	+X�t�}C�����z@'�;���O=%��j��cX
���6��
B��0X}�N����i����6')�__����@C��hzZ�,�
���z@'��"&X���[=��I��5�&$��2[WBT�^(jH�y�<������1����R]e�-�Su�j4�^]==*�2c	�Y�Nz������3�O=$x*�$�k'�g�a�`CN�7uK��k�]�v����$������Zt?D�O=wM��.\I��C:aOC��I:�����S���%Y�9��P�������;�k�������p
������Z�H���zH'n��o���z$�E�R��*x�C:q�����'\��z������X��N�z#�!q�UB�����SO���U�tE�@�#��<p�����[������vuC#�n�{LrC:��5�����cF|��-�H��_
!�������EoT�:�e� ~=whZ���4���[0�|Z�V�I���:��}9�!p�U�������!p���%lp�_��M�"bT����x\[-�d�
nU!p/q�Ul��[[�eL��4�����N2[����0�Ps������!��28����;�����2�IG*O�C:i����}.�Ic��
o ���MY����o+T��T��������Pv��M��2<��9q���X7*���9Y�q���C�������c��i"����N+�:�u~#��,YKm	BF7<�pN�����F�w1B�����Q#��,�(�B���9����?�g��Nk%�����"�t8j��	�'r��pG���q"�;���z�@DN�wQ#�;���	9�	�Z�N��JW��hp�n���j�=��DPhj�x�7��'hv
�5r�-�;,D�ac:���u��!k���6g7UxV�|]�����_������8�
ivZ�_�	�~V��2�z
��Z�EU�x�]u�q�t�{��+Q�?�Op_2�9��vH'k/�b1�
^0��5��v�%��������2!����&�:i\��,Uz2)������d��A�������;+�p������K:����`)6�Q�tmt2��A2���4Y�v�?ti:�yy����(_���>UE�8�B�A�}�����x��`c�������vp��Kz�l!�[f�f��,Z>�9>��(�u�sMa���)jxb�}_����3P�:4h���Y���P�J	5�WZ��,u�0&���:Y�o-����9*�:ihh�QQ���Qw	�^Eq����]B��[����B�W�l���������["���>�CU #��m�/��$XtX,�[Z_I
"���:4l�_U�����������O��fR��9���k�z�Y:����|	�J���8�zi�����m\�Z�v��-=�Kt����pN�������x������t ���v8'}�td�h���IwD��
 ������T3���C\[VP���bh8M��3!���u+�������D�S��3���b��z�j�?7B�4`� 
�zD��������1�v0'�`	�����Q|reV\�/��E%j�����0'L�����>��E/�Sq�jyO;���|������d����vM�ks�*��TG����
��w5r��(eZ$�b+U�Tk�g��"�.Rk�gb��l���-$�����t:�L/�I�S��J�1�����T4�Oz�m[oI?� Y�����J�[�����0��0����t����&��d�P�o�������8��s���zQC:>klM���{)�#�5���t�.B�.��9��HH�@�a
�M��K2������F4"�t,����JZd�<�1r'{w��#�:.��W%	z�dBnE�� ����S#�6+?�5�j�r����f~Vm�~li������v('�.T	�9h/��R����^�r|�0��P�=]���F?N(�*���!*h�LggrB����nF$
hdz��s��zP���li����V�C9Y%�(����^����E�(�WxM;�����Y_��d�����_xj����AV���C9Y���{���Qd�R�A�`Om���r����XX�5�-�E^eo,d�i�C�
�����h�n�t�������m��i�Y��4l��,�ic�o����'��w��*�(�-�C��+���"��*Gx���R�t�U=�q
m2kA�w�Gj�
�#�eTYSJ=����#�4iU�e�����j���_�j����W��������L���<����'5�����b�x~R����L������U�<��h�L
�#V�z�����:d�����\}��sf\x��L@����������j��o��.v^�Z���m7�ya�_vT�?kQ��$����I+0���'���@�!?���B��+U�@<����Y����n���r2vY����iB�N�x����U�=�����p6�����e��������F�������E~]���t"���-~��Q�z+�a��s1f�c{��
���t�i?���t���I	���Z�DT������9�n��3|+�����1�U���o��L��]�|cJ�o����9	g(���W5��[��!kJu��85������4k���j��o��yF��Qj]s1rW$Go)G��TI����x�I��{�H�e[��kn����,x���z���55y�������k$:a�4�[���0C�d23��[�~N��G��|������l����:4a��I��'�������|em��GG��������q�o��YH8'M������fD���h�����T[&�C5�k�*�o����Z+������N=����c4��3���{;B:�Rb�y\���o����,��,�T����*�,%a+M��1�N/���
�{1��'~9+��W�wb�,��*~��SA���!�*`~��/���T���)��6%�,�1%/#��
7}��
�r�$���[��\�������[����6e)�������Ot7��_�En_I���
�m���q{���
���;���}u6_s�o���
�/A�~�����G�oLEQ�{n��P,�����+)MGn�U��s[t����[�v��K3���������p��������tFB��*3�����l��t�o�m�r��8RghK�>5l�-xd;��l�R��{���k%����9B��
K�������k����64J�����C6"G�ZU�U.�V��j���MH?�Z�[�����}(���}��i�E�J\AEq�_�%��
X�wf<�.�V�{�X���6����{���y_�xf�	m)�.L��%m��?������QUGn4�i=eA�������bv������aC�4qF;^�e��
��VA��F�������Z��R�Z_OaC�6T���h���������c���d����t��#������[��Z�kl�:�'S�6M�m���w����y&v����}�x�������1Cq��{c[��+�o
������	�E��o	�/�7�H�}�K�������3�"��[��C3X��+�K�v�}(j� ��9�#�
���{$���)�-`��q��k�`��������z�o���\g��

��D���rn���9����������6Ti}h��`��u�������
����;>F���d3&"E��t�gE;����j@�o��������"��p�����$��)t=r�/HB��j!a��u_�&�E�
~�mCK���=&�;o��u��GV����Z�����?u�:r#W�]a�����n'���R&R�8�
��7T*�-Z��*�:�!�f��8�l�t�+�al������MwMox�2Wxyoz
6���[��4����k>����-Z����%dor�aC	m�'�o��f��)��H�������U�6s��$��\�B�0��ebK�9����E��J�EC$��r��Qr!�&w0�	[��@�jL���V������D������5pr&����6��tO�Y^���{3�2H����W��
g�;�m�C<�����R�ruv%�i�N�����j��s{����Tu<���la#��s=0|�8�K��n[�������I}k�}yozfg�+����T=V�Q��h�5�}+�}v����������Ea������E����l1J0C����n�����[�~N^dM���q��N=���o���y�N=:� ������$aQ���x���n�����ZM?��yAP�p���C�^<4U����$�o��W�$Y�T�H�*�SO�]U�%4����S��M��'XG��f`�W�DW�7�3���"o��^�����+�6_3��K������,��W����Lx���62;'<����ms�-�H����N=SFU`G=�����N=&{u�}�P`��G�>��A��[��<T*�u�mw�l�u�W
VY`�SGb|w��#��R���B���g\�>��W�:�WvN�V>��o�uj�m���]�F,
����d��j�g����7�a<u�m�z����:2�
7��S����	g�e���!r���'�J�CD��S��������w&�����S�4���y����=W���\G�$Y�m���������#�P�[��:���������#w2�s��
��������=�ie���u�Q��P�o{|j�}�M&��$��7n(30�4��S?����
������:������z"���D���U�s<s�e��#)�s�8���
��B��h]
�[��>�N=[�G2�w��u�il	�Q�Po����nC-=�J���}h�^5}����N�x������5q~l�z�JjY�}��n2�N=��z�W3t��!D1�W���o���P����:��E�����uj�M���bCK���R"��>�j��eOh�	ZY�>
~|e��Q��.PyF���={l�z������H�G���nU4��]p�N�L��ML}�:�3�~�!���S��Q�4��d�3
>&�������q����Dh�sl�z.p�f|�7)�f��S�F���38���S�F���3.�k��S�F�����6��S�J(���Hu�:�T�lAE����#C�53���N���k}:R�������
���<ue3:B�^-�pC�O�q�]�
�J����l�}���0B��S���}��Y����8o��y�����K��y�F�d�X��Mv�����������>5���oR#�>��D���'<��wM? Ojl�z^�)JPDG;�P�����v�u��@b$�v����M����Z���!����1�%l��Z���]c����o���0%�.�����}��8�S�=)�����,��F�/�!)9�v��x��R+C[j��+��m�z��*�P�;�������wK�?�_���R��
lH*�*74gGq��w�2�p�G!��Z��z�U��j�j����7`����`2��Y��67�X=�Y��e
����VKn���v���9�K�@<#��J�\}R��-W�Y��:�p!t�q����7D3��F C����&/�:��I��}����������6>`#�!	��k�Q~�Y�#4��MS�gJ-�����gv�@�f=�E��P�������;qs�P��fm6��
d����n��U��j���V�#ZE�A����]�*rEEu�DNc��%ZE�{��#p�E��1_F������{zE���j���z��9����[�^������5��d,:cQTV�rZH����_CcK�d-���:uhlI�J{����C�]m�0�1����4C�^V�c����;�q�<F�
��:7j��5��0#�:�p7��^sk���[���W��ch%1��h%�J�����/���1�Wt���W�o����
��^-R�R]c�v�t
}[d;@U�	Y�}��x3�Q�h�GG�y�DV��[��}���T�ft��t'dqD��B$�'���k�h���-`��o*'��n,:��!Y���mP*�y�RMhN��l��t"�����)�^>��u���b�4�FpP�� r�!TY\�~�<
"���d+�Kr��4r�4K�\5��s5��:i��>(��vrCp����#������k��J�[�H�����L�%4+~��a,��
����<������gj�������+���U��:X���h�Z;,p���*2<+fN�m�Rk��UK����klG����*Ta����U�4�3�$�d�d���$d0���v
�)�+$����4�e;�Y�f������_-���0A��r0!�7��31��=O_��0UEJT��n1{.V�b��&1�H�^,�?��f����Nl����	���L�T�b�o{����q\�y���7� m���UI�,�3�X#��D�G(@�GDj�a�����3��O5�{��z����{����N�<y��+�������I�*DQ���18�];�a�����"3���%��8�v6D*���"!ip�*"c��}�wS8��0r�Smbi�5�m!�I��b���	I��>�����9�=?��%���e��v�������V�������c���Fx4Br����
����%(�G��B�5R�
sD�����n���9�]�����	 ��+�q@���F�y�\i!'��{��������z�S�W�Z
�<6���U����QT����#�+l������4rSm�F�=C�X;"���%��x[�H��z�wv���Ph�l����T���
!�z]2���Ih���a<��7	����Pu���3!;2SE��kN���[�Q����)H����e?����!wc8+p�FB����q��v[�i��nC���(**�6�IB%���SjD����o�8�0Vd������8���KUn���Po�8�L	��Bi
o�|����
�?��Xh�\���_���R����i���FHn��A�'j�-4B2W����O���}d��HZB����-�fjg���Y����*��u`�o{����Rn?�Sh���"~m��8�:D}V�����?�-��)�kc?��4N���g��������Q!@�3����+DK�5���
���y���A7}��^Wz������
��V+Tq�koE-���>�3����0C9�2��2�E�����d�f0�*�34B:EO3�SFFm�A���������������#�T��g����0���w9����m��D�������+����rwxFF]����Z���Q�h�������/4B��j��%oV�3��;3�o[x"a������{PF�FH����a���*�s3�u#C#�yoT�Q�L����tIx���xc
�Pm@����Jr"Qpw���P"����c�nC����]���D�Z����lCej�?�Oe>��qOG#�o{�]M|g�wv"*��9f=�xB��`�H��E#���#4B5�.U��s����j����h�L(���#����/g����{�j�
#�$A��RQYjC��N�g�����nCvi-P�9��maC�s�Y�����/�
M��UG:\��x��?����q�X�����������go��l[����9IVq��+Np�z���h�S;\��G@���GV���
���R�����+��o�`���-}�IU9�>���T$Yb@�K�x��1*�]'�����YjV��gom�|=���bEg(4B�R�j�4F5�6���7�8c��F�����M���Fr����>�q��I]BbM"h����3��B���:N��L.SJ����������{���o8�4Dos��N��<����gY���n��&Vw�da��t41j�-���S/��[����xfh�0=�X���L�CU����]h���g�95�
��F�y�]a�B#��	��$����F�����*��Fy6d�*���t����q�{��|��&855������QQ���V��Z�=!N����]_GzM,�����4����nM��������������1�d���v�8��.K�w������_�JT��j��S�����H��H�!h�o��P��=!�{�������~�O�y��B#djw���w^��Wg.���]x���a�C�(f����������^�tc���b���Ek�J��M:������5��bEDs��[a������#����\�OS����`�w��=��b�q����P���+u���)���VQ�w��&L�N$�������6�rD����aC#Q��z�]G���&��-��s>15���./M�����CO��7����g������/���**�.�U�-�FH6DR^�$�����*���}�m�
m+���un�m�H|,�)�����?��3���8�n8���	�/p����C��L-�q�Z�o8W�������
*(f����>�ed������]a�M�{v�z����
5�6D�[V����64B,�'p���*G�O;U��HJ��9��v)���H*�F����|���94B,Q��:�I��qj{\8�'P����j�}��B�$;N�6�jk�������k�"���(upn#�q���������N�;����z���WJ���q��)l`oU�e�S��l���*d=����]��Q���nC��C�VEI�=G�g+�D�D��n������v�i�G#dqRy�,4BzaE5{�e���K��!�������O������U���F8"��U�T��N�CA������H��m�.DR�:�<##�A$�y�����S7�&I��?N]�u�f_������U��v���q�V�:�b��F��8����)kF2�����rv���t��3+?!;Nm��%I?�L�C[]�_Y�=��=�'wY"�BO{��m7����8����8�M*��f�h�,�BHO8y�������o#��p/�-2���u�gDo�e����j��0f`��/�*��.W�|�axfh��(2��o���;�,�q��q0�������8�[����?`��������S[Q^��"q�q�����8��+�!�~o��+��<� ���Y����W�Q��i��F��g�����j�9���W�j�3��:<A�L��������Hx�����5B����e�^C���r�s�j`���:C4��>3����^s@#��Xr`�;g^���&}����-����������e��U�������(j�G��9FK����;{)?Q8�XQ2<������VG���+��^�	�4\[:����\m<�1+F��_p�thU��S/�W���=p�9Z�qGL�\�������Q� L:��M��
��������r�����y������+X/���B@����u^��^��&Z����z�����1���K<�������uJ���^
K�u!�5A����� H>@/���N�l ��P�.Hn8$��m���FW>�j$9azO�k�S�Q}��������^�9�Q��T �C�`�J��ZuY2��!�A��bV���W��r!0�d�n`��yBT���_��Jv\�)lx[wG�w��JS&Rq�z/�����*���uk�!?'rS>�i&�8n=;{�fP}"g=�jX�U����Z��O���e�C��u���w���\��m��S7:�\wc���I
��J��%����������L',�������2h�<���A!������_B�U�m��F:)m_�%�f�
y�����o!�,e`��2�^��C���,fy�d���!
��=Tc���u9� ��(E�a�Gd����nh�����4Vw����Ps���8Y�a/@�]����uC��
:��0uM��$N�;��QUa/�NQI~G���'���@a���vE�A��2���aI����pCt�)GdN�P
b�;K�E�!n������Qe��;������p���OX�q�WX'x�5�8���L�������:���j]�[��5�����%&�H���4DA�&�l(5�{�(0�X���;��-�Ddl�qQ���5JVu���gR`�����
�����r]!�2#8jH��M����8>R�N(<#C����(��po
Q�
gW	iw��CdGUSGbn��UB���q'>��
Q�v���<���(He��wx�Q�j��t��I/!
�����DpN�ul�C*�O�=�IW&|�����g�b���Q�������8��l��E�&������$G0j	v#X_�l��J<I�Dt���bn+7�Q�Iq�Jix�����v�o�v��� ��H]�6DA2�>���p_
Q�E[A
o�>.G$aU*�Xan�Fy���(���7I��#q����0�Q� �o���0,����Y�����t�/G6�P�P���9"�������`}J��!
�(tPx"���DAP>[��=��(H��PX����G��p=��IG����
"\WT��ga<D��9U��g.���i�p�-Q�y���r)LMan���S��O�jj��la=�jdi:�:v=����"������4���3dS2���
���2v�=��fV�n5[P����5�Q�&�d1�jj��l$X� p�6�m��Pl*C^
E�5DA�%�������d����;�r�&G�(y�Y���3��u�;)@��
�u����[Ri
Q�[�f����E�*!�S���(H�������.5�Ua�q���������<6�(�R���(��(��lK�^
Q��i}(���(����$�W���1����r���?5��Y�-�������wo�g6p���_u���>n���u�u4�n�e�;��
��m����X5�����2)��6T�
adf�{x�#
2I�Zx�3C���|,!DAF!��}��WBd�J�]F�-4�!
���i�
[#�	Q�9���!6i�<�����:v��~"�K�&k��<a���Z����jm]I���<� �����z��!
�(�3�\�h����{Y&=�#��
������m��5I8�L�����[�W�!�P��g]���(H��
]b��T���A�J�,n��A���,��1u2���ml��[C��x[�+��S?��"L,�Ad�8�4J�$
t(�SC������!
��J��Y:J��!
�����B����z�S���Q8��3�\Yxf�!�sH�5q���R����R���(�n�,�a��;����8u�z��\�����E�� �z�`����<y_
�zDA
2���D�� �4UG�����>va�9�32>� eQ(��e���H�����t?ds�n�����(�'R��-I6j}!
R����	��zDAX�H�>g(DA6:����`g;N��_�����Cu�zl���{7e��g��^�(%	�3���]*��W���:N]7�,>A�����#p���V����:������g�[����r?5��@�7�
:��i
��E�c�m}X'-�����B�
o��M�Z|���K����bd�c��d!�Zp�	Q�v�w�Z�I����
��(H5
�%J�U�1��;H�v&�RQZ��^[TI���v��vDA.)��S0�����F$7�����_�M��c)������I��e����d)��9N���C�ZQN��������&%i��������2��c���Xl����9N��%��C�+-����'O^XBp�Ie������kUp�Y���� ��o���,<3D�b�[��j�S/#n�oR4���^��G5c=C�z�����q�o�S�����S
F��LF3��b�����8�mQ:�l�q������+ ��s�8u��9}���M�3�%�}��Z4�;Rs��9����� I���-���2�3���9N��X��O33�~h1_8��R�[���i�p"V�q��	����\D�����-O��&������dQMQ�|E���v'��S/4���LU����L�����Z����b�sI�������W�=db!��W���+S�����
Q���l;�&�sj9`s��n9A
� �O���hs�z�`Lvx�����D0���v������
:���R�D�C���I��t2<X��w�O���*��B��v5Yu���� ���Y@S4���(��.�e��$Q�����!,��,]�2z�nG�fJ /��-��7�#o��h���9��q�o�S+RX��q����se)4��k<�v�,�
]�x���n��I�Dpw���1V��@kH���D&�a�k��N���;@@��(H�|&p0m�����z���*_�3ChoP��U��#Nm��`y[���K�Qj�3�X#E�'`�?T���:�r��*K`&�	����wd�3����(�$����|���-����Y�	�S����>gv���+w�l7�������_��N]y�N���.WS!��r����������9N]w7_�MI��
5V����Z(�	�X1�8��H�]�Q?�����2��[d��U�Z����"�>�H����2�P(����w`����9Nm��jo
��m��
�G�K�L[s����A���G����@��fk����(�9
����~�-"���E��8��%�,�;#����w�z��K�������R).?�l~����5�6t���(j����O��)��2��k����{Be��g��+����
`��C�$��h���f��.*�X]�vz�����/���W��f����k".D'5N��2+,��k�x���"�H���#C46Q�
���w��
����,�F=��otw�����S�1����*9����6O<��48��_N�����05�H��m��G(���P7����A����W'(aN~ q���
� �%��c�ziF������CEW�Ju��@u�������V�Cd\(\�* }"�JmI
��
=T�%���P����!n�3�0�v�q��z�\��\�T�3= ��qa,�E��z�"�rkE�z�-%���f���#�����|�S�Ou�/U����_<���F��.$�F��;T��-�Rgip�!22
��e��6�;�J��NW0���V[��Gj�rU7@��b��������/�(��C����
��ZB��I��6���C���-u&,!(����
���
�X�`���h����5 ZB�
M �U�E�W�mM�u���bo���K|f@D������U?q@�������+b��N?��s?" � ��p�	���������j�_V5t<��'Cc��hu�*.���|�?��nF��zTKj!���=������4q���04��%���C��K�������Kz�0�3:.�H��Q�S��/lxa��[��}��v{���.���Cd7+��>�N�����$B�F��U���b�EC�����@����m����e�Ps��y[h)�������1����%cbK�`�L|w��|Z�9PO���UG��u��������������9i�H���+���`���Y�IU�
�� `�lm�0�.
�-���(�h����'�5Ia?}��i����J{h�����m��V�;f=6st�\O�Ph�l�[��8C�����iz�nd���>�5+:;p�=��14�n�&�%64@j��grO��B����K�:2�6��1���}tU��J�n���&U:�@�Cd�X�r�h-���e�g�k��g�Yv�2{�����u���|�|��H�1��@�oN�����CdQ�� �8��O�z5���z�l��K(��U�a#0����F���*�0�+%~'�����1��X�ui��m8f=f��
E^��#.�J��d5;��~hP��a8<�������
��60�y ��A&��#4@���BO���HaC�����":�~���.l����������gH=�p�z�K+#��_�c�o��jC��
��G��!\w1��I�5�XB�x(S�
\y4'Y���.�>Zy�7���P(�J�d:���
�h�>���?���`!_T�O�;
VtR�LkYFh��K����Z�?B��u\�*�t?4h���x���(?��	�5������`�����x�H}`���������3�~h��,��p:�HS���a�u����@���_j������K(���F��mqs��X��,jx�&�#4@f��IUfEF����K�,$��������1P��? c����Q`�G��Y��SY#t�'��*rSs#�����
����<�jM�m�N��	�
���:y�����l$���J��Jh�T��3����C�JT��wz<�0������G3�I�=�
cT���F�+#4@U���'8^��-��������sA��5�mO���N��gY�)HE�e���hQ
���Vgz����x(���nn�����|����]��ld��'��UT�b����
��NH��}���G'� |�������E��+�����@�J;�O��Bn5�b��=��f�[[�+�4B��������
mF*���;v���+���&xjh��gY�[�-Rm�q�i�r=Yg�o;�]3D�L���S��\���x�s���]^cR���
�����N=h�
-F�6��-i�.M�ND��S�Q'4h����	�
?��9�g�6#�2�g�L��g���+h�6D>w�����:��e�����Y������R�f8N=;�����`d4��r����J4B$1�pi�!�w�z���i#j�h�p�z�K}�p�����G��R�R6�
�� �V��Lag[�k$3���o���H��,���l�P;C�
�C�-��NSy����S���(l%�f�Vu��P����p�zN�"��9N=5��"��Vo�v���bC�q��5��n��*�w�cC���D�c�Vu1Z[�)]t������	rt�#C������VaJ���U����W��U�&��T;�f(h���>�7��#C��������U�����A-~h����*�/[���i�C�h��w:N�LO��vh�
7��8u�y��>X����^���W��!4>3����Y��U	
�|i��91�aC��cT��x[������=TgP��n��3�n�^kdh���38u����Z������1+Mra�c:N��[����V%HO�����R
-�3f(��[��	o;����jC�`1���p=�,c���x��z�y_����L��[���'T��1��[�BTr�G���8����D���\v�8�j��Z���8�����&v7�YV�,�n��[7���P��s���N��-��We��z�
Y6j��;V���V�Q6EF<t4@�U���2�^��A��~d�E���^s26����`����q����w����]��V����S��X�C�DK	����d/R���S�Az �@"���m��P<!+�w6��g�]�
�}�St�y���*�K���z�����ky�����%N�~(u��Ss	�C?1uE��(�m����'R�3�Qe�5��n��B��^���(5h�<3�M������<5t�����f"�L���5�s�b�����w�H�t*^�C���t�fb��Z_��e0J��+�?��b.���[���h���c\���Vw6O����S'T"S��fX�9�'P9	#�^3���z�6�8�ZsQ�vo �:2��m�:Nm���'�x����������H����4p?�u�t�z����� [X�����7�.��z��
?�y��?q~6<3���T�`�_��qj+��D\�o���x�#��F��fC��Uq��
5E'���B��[t�n����an�<�3�,��Z,�XD�
]��(u1C�����zj��.h�]N]������~>u���h�94G7�^6�������Um��Or�?OOE�y5{P.r�����4u�f�w������k�7T5����Au�tB��[���������B����Y��
n�X�U����dI9�q��,����3��e�Vu��$�o��W�g��J�� >D�	��cjcic$���	T�J������1�D���J�]X��g��g[�-��XBhUj����l:2��46i�uT�{^�S���J���Vh�l]R�y��%:6,��+]T��"����a�F���H�r����=���G3�P���F��*S�����*��8�%@2m�]-����ze�N+��w����MQ8���u�+�5_�S7�Hvw�i��O�s4`z�mxf����I���K�|�}��M�YO�|�r�)o��e������fc�3�3���%v����
�z���*�&t�8N��6������l���+����)������^_�Y[� ������:�����x�Yv@
rt
zV�[��n�b�j���Z���O=)��!�;���+@��:�P�`9N�YK��Hk�(q�������Nb��
���	��[�!N�'�8~}f����%���]�S��QB�J��N���e��\������0h���rC1,�	��@uN���#���_��n�lX�|O4LwK���P]��Hu��w���S������U�����i���'�Z�; �T���E+�CK�l.��m1���9r��v����:�#2����uh�)z$@roT�#�����e���ddxh��7�L��Z�0w����j^>�
^��puJH"LL�5t9\�xz�w��h����/s�L��q�� =M�������	����p�H
�h��8\���~��[�(^�Z1��6�u{���#��s$v���#O
M�������!
��Pl��V]Xt���O�W��3[0���g�+A�N}p=I���
M60eKC�w�?�
~T
�;=�:���YA�^�����g�@�F�O���T��@�%��|����&����ZYf6��wY���pu):���d�)/����z���!�������Z_�lL	�KOq���Ggqn#9z`���L�N��������m�X	|_���Aye������.#�0�(nY�!Z�i�����K�)�a���EY�����X�V���d����
���!�1���r������,��,���E���'��U�_�]r�����R*!Q7��]�������������n$�,�6���!���4��{��Kgv9s���m�`=;Z>4B��w�+>H�	�r$d
�CH����~�A��,����
}J��9(`$����GB���-�[�\q�%[�g�9\=3��Bin��F��qB���9����[-/Kg,�?2��1uQz���GX�j�(��1;��t�/j�F�����*!�Q92/��4�gA���t��]�p��iW�Z���cu�i����m%l�V�l�}V[�F��5�Qf�w�
U�B������E�:����`��*�jz��BXQ��-B�g:�qj[���:�X��Ga�cG��!�p�����Hl��Zu�@
��:,���������������c�y(�)������!TMIZ�t�4$�����p�
H�	V��P���	s@c�7i���s%�?����;������3�p���Lha���a�����3�0����
�a(KZXO������r� �t��K���QG+��{��:�4\PJ�p
:Z=r��X�5��q�e�&�%_:�V��Fe�*vv����j)1�}��9X=�-B<u�Y�7tZu���]"�2m<�S��8T�]b%���]�'�h���Q�MCZ{�m��vZS�������Da3��<��^���ZXW��9N��qh�=�����S?G)�������^�9�&������:Q�7��,���q������9��Eo�}�V�{��+�0��1;3Xw�12��ao%L��3�HuN9Sz

�������V���A#�r~f����k�X���k�}��L?5���lH�U!s�����5�n4�3���Q����������k�V�����x^�tZ�����V�>��v>�B����-:,N=���CT��	$u��EV+ s[5�f+��)��V^�uQ�]'����0��}�?;�u��P�`+�h��bc'�]�?[����.�8�a��l���L6��:.���N���`�����v�:����=��EX>�"���7c��qSg3;�[��o5~k=�d��"�p��at[����,��O������QSS���u��E��LH�i����y���.��s�����Fb�gFJaI-����"��3�-�YTPg��$��gh���	
�Y��������1@C��'�3�����AC�u$������qJ�4��'Z58���8���f�(����_2��eT��xC���D�=
z�����5]�$ e��?C�/���j���I���g�(�i��H9�Rb��)hX�C ���eEQ!����&��/�)���u��F��"E���m)wn�
�$�3��qW��k�t������
�K	K�h��������fAo�j�$@z�'���&(�C��i�7}a|���x�
�0�chx�d���M��$JX��e��������v�N�dQ�Z9����������3����s���]wk
�T��#@`�9���BH����v��TJ��JaN�o�;�DJ &`ij�RQu�7`f���F��x�n4���T#����G� $,Mx�f8 +
r�4�-�
��SC*(ULS�Re�uC�W��-<��w���T�s�B��#PN�4�����3����r�S�qi��<[U��1� �����
����yGa�P7�c*�W�-�
��P� S���E���(��,W�����3�Q�������~k)�2f#QA�JC_��3�2�ah	#4x��`��.������f�����	mV�`����WY���/-������$������,��c���_*�=�(5ui���y�FAT@g�]gF�HlPsV��g���65m$O����1c^tW��1��_��H���:���Y��;JB�v����L�8�4�g�oBp?��1Dy���E����g��0���A���wJj��y��POCRQJj�S�o�>�J��<Gi�-1���6��bD��pL��e�\D���TYZJ6Sx�C�iI�����$������3���]�����6��0������b��������wWK=n�9�&D�*KZ��cK����|o��O]��+���\�}�w6������^|VUHd�=+�RAnUK��p������W��w��pL�i�+�e,����K���i�V������Yu��]
X*�9���]��?<iy��0���M��1!	���& ��4!�	��l>���i��h��0�c�w������Bi�	;�m�3+Ry�������.�������r���8�bb���k��;��4����)�]_Q}��N�����+��"�
����JJ,�%[�"�����h���*W��9����S��04<�L
�3�[�3�jY�r�>���0��c�}#k�����aL��m�����]��hi�X����f����T=!s`�y����@������O�=A��S,"�{�7�YU��a�o�~k� <�����v*���K�n�S������}�����/�M��~�LG�U����%���a8?f�hZ���.�zh@K�f�w�bH���H���}�]�
����)��w�h�>Un��[u����)�"���kU.��[W4���'+������hd��{Qm�
���9��A�c�p�����3�b~9q��$)����i���z���I�
9�.����_�3*	���GO������	���A�.$��@���za���+�3Q�������^}-�G���N�X�W���H'��������0���cGh��	@��lv�{f�(�EQ:�Y�����
�Je�Es[9 ��+w4`�Sq��w����/���cO�Pi�Q�PP"�2�C��U�YJu�V������g���F���P)�������"yv#`
B���H�L#���Q!F%�O�wT�������;r7-+$����ng9uip���
��n��K�UUF��n1s}��o����=��7��k�4���P)@����������>v�W����0�8�v��n��`�&@���\/��>�������d��z���U|�v�((�}F��V����3�
��Z3�di-�S���H?��'wME�K��:�=�N��D����F9����o����9������_�3�t��b{�L��;r����#}��,U=hty,�-N
�
�}b�F���9@K"�WF���`����v��J�w�x&��XX�w�{����J~��zv�s��������������[mhF���]�jT��32�����e5��sty�g&�
�����9����S���r��+�������X�=8�8��)YB���*U��r$�D�jC�p�!�:�;w�O��jR4;�=�,
���X���]m!`T��j��o����A�]+FF�.�5��N�j}�n����!���:��!A�rl{}%��aC~(!�l|ft,�HL��6��l�7�Q#�������+N�P�^������amKW�
vHP�]	��JX��MD���,�ti�S�>��G�
z�W���g�
m��j����mC�x!�1�hQ�32T�;�(���"�3J�gwC2gQhH����l+�����K�3��_pg�P����g���z��d�^�z���4�7�����X�ev�����W~F�
�����m����t�s-�����zv] 
O�O��yF�w3�_�P�s��zv�����A���_k���,W���M�mh��3���.���d)�F%��w���H��MB=�2����R����dbZ��(a�1��b�\1�/f�A���j
A	=�M6W���b��S~k�,AuYx��n��-y��6����b�eJ�)Odo.�[��������gh	�:5�T��	D��u���4�ZI�	>��F���P��u�&�D��yX�C�����^�@G�m�P��d����
S�
��~��0I���m����c�v�-�!�HK���v�z��1��!w��h��+5�\�&J�.����k�g�K���l�5�j��G�am�.�ZvC"��P?�v�*%P_}�tIk{o�Y�!��v���p��)P���'P����:�O�����l���T
��h	�v�����h�g�+�?nC���b��B\���RY��9�-�
5��W����mS�uz)�u��SM2P����S��1�}�Rr;�q�����L������1����K+T�j���I��a5I�tr9���x�85C�[c��QP_�Xr�xc���x�Y
�v�DD�`�>u�����)�F���~W�|�%Z�k3��.j�uJ!�
�v��=(��R�r�J��3
��xQ��0@eN�U?�)��
)���z��L�!1��E���.ZLd+���CA"���C��������j�e)JV)���ut�RC�?X�y7������O
[*Wj�U�c����r��[Uf��m��N����3|x�=�t���?x�y���:e
��o;��-�������j��iZ'`��L^E��k���w>5������MX�u��VQ;���o;��eE��@����a��lY]�E����GUR�����[�+�~����g+�4	�������S��L�����YXz�� v	�v^����B�yy���������1���b%B���/}�i�������7�JQ��*,1����R~]N����C��J� ���!�ghy�4������f�@u�|a���l	����H\�zX�������O
��6f���VY�8����.�����������z��*���S3�X����zX����5/����y_z]�U�)�[�b��b�-W
�z�o����������!��N�����Q��7�]�D��v�T���K��6uU�b�b�S;c-�r��]y,������g!���}���������k��gh0mgF:A�zU;�����NXk��F
�v�e���lw514lismt�\����amo-me�_U�j���.�������i:�%aOC��k�R�)�0v���o=�7d�U�bh��Q����&�1M���.�F��B6*kXY�1&�:c~-�{�Q`��+���zZGf����#'N�zL)�p�*���	���[@�[�:��C��#�Dq����Q��]���4�X���hV���`�s����U����zKv�{5M'�ao�QT>�1���].6�w.j�`L,B������:�=�E_���`��@��w�LP���n�M�� *^���e�oGUSm��2����BN�K?����U�S9$K
J��$Jp*c�)�%�Rw���ao�b9���}����w��>��nL-�H�L��T�0d�j����Sa�G��.��Q���[��\�����z��rI�]�jM���gw� LEI�ao���U����"��6��b�������w��n�]4��[�S7z�t/x��s��h}�B��ej��cJ���I����d�VlU��YV�O}W3����x�pL���4�����K�����*�X��CZ�k~���Q���Y���1�	����z$K6p��@%�J���3K-X�:E}'Yr�1�*��}�x
E�����wZzv��w9��n����$��C���f.����#=��X�8��5���z$K�*��R,�c}��_��^S�.�&��_�S�o��L���Z�U�f�n�S^�����gj$����q4K���z�o@��G��T��0�=�Y���PV] ~�~F���+���C������~G���4�@x%���"��I���1���zK�gP>�u��wY��/T�D��wM�
vo��������z���d��e�����A�w��r�u�l�~�M��}�3<thOn5'=��)���v��a�{���04�r���"�����,��3��IP!���[�Am���Hs�{U�s
M�p��}/J�_.xaQ�0����++T�����b������N�$�D���cJ�$��I������X~�t��V�(���zSi�q���
�������&�W�v���K`+�M�S���O4}���x���H��Y������9��.��4�4����cJ����<u�� ���������v���R~��jG�d����nTc�v��/��$(�/��~����Mks����������L���)2;���(�U�(������^I�G~)D��^B`�\C��
���o�
7i\I�r�o���z
�V�Hg�hB��J�p��3���nAZad,�Wt76�$��8C^����l��~���[������ �Oe����2�m�t��m�&�MKH�v�������Ll	_p�]��nx�5v�M�=������l�Uk���+�*A�^��V04�Qo�F�l��- ����T�aB�B�^�����,am 6��Nv�#��<`�y��a�f25�������~	`�l���d����B0��2�NL���g�fG�d335JJt��T�8M_�r*�Q+��Yk)����V2�'���)|��Z�E��pj��Q+i�^�Z�]��.|�������r|	�M`7��m%�]p���_��6�w'������TMI�m�'� [�B�d����3P�@@j%�}��P	h�V�)�������I~F������n ����V�LTv���P+���K�a=���
�l��w:�=feC^h�B�9�=�j����e���&	��L�3��p�{���h�n$�[�����H�j�V�7���N���������Z�V_��L(0`��ZI�\1E�g�j%FRUX|W�fs�{,�s��Z�kG�D�P�j���
9�=�������w�T�F��R5:����J>)���������E����E�#et���e�����8����p��8���9�=7�m��`d�&
��.l������s5^,AzQ�9�=�/u(jA<��Z�c"�M;�({��ZIo�@g���P8�=g��X���1��[7��@
�+��9����"��jF��!��CY,w��)���nP����i�~M:�mh���L�pp"u���n?^�l���P+IT�3*���w�W���(5^��nC������������C�O��Xvb:�����3T�Xsq)i�]w({�LVF{i}���ill�{���@���7z��g���N���M�A�a�U)m8����3�m�RS����Ef?���[Z��a�>�PK*���U��������9��,|'t��#��� �2�7�m�p�-��4I���_��U$��g�32��%x�eP�`����
�������_��R�Bg���
u�	-�x_xf�e�2����g�	�����c������7i�������.���X��Md�j}��P�Aa�6�fAV%zL������k�[w��	o��+�k�O���������������!G>����<y�k�|��^|7�&�v�%P�-Q�s6DJ����J��0�=����[���^g�j���!R�x�-��`�-�F�diCMc������<j	�2�}����oj|j�T\-x[��O��^D5���u��I���]�R���;f�����VR�TN����z*k����aC�zRF��:c�{vw����5>���9`m�F��$<�;�6�^���cE�����u�a}��s��.N���T1���vSh��*w�;Xm���y�Y������0���(��j���Lj{�'8Rm���	N8k���5��t!7���=x���|����js�iJ�N=�SkEY�hQaC�	Q�����8���0�>��P���S����=�����qj�T^/�CE"�����u��-���3�P�
`���%8Nm�)��@�PBTw��vbT����l]O��m�Z:�J<���}�/W�o{l�1H�
��NM\S���4��8�s��&�
�0�W]������5x��/�e���6�1 lP5��C�$]u��+�����^W�C��cpE�l��#�H#���63p,;�8Pm�`����#���! ��,*����IS�$y�����d�

f(������fT[g�yj�T%�t��9����v��aCl�����m�j_���M���6�Zwt�[)S5|�T_�T��W�����Q�(��y>�`��h�@�����-���d�r�|���/�$b��c�f�q��pd(��j�T�[:�}���������������a���7���[����s��H*��t�k+]v��H�q�*������-����[�p��6��w�it=�����+���{h7�~��(�����k�:������,|g-�8�#[�L'�j����(r����"uC��5�w�:
������u���%QX��aH�o�~�����0��k��A����-&)<R��HG?��Y��#isu���
!��~bq
�Fc�F�Y	���=�(�6-qM�8eR��:����6��d���ak�+��Ix��;Cy^��s�����[�I��������5�o2������0�
��Tc.X�z�(U�5������J�^G��(�m�md�w���v�����b,1$�#$�w��0:�Hl��V��E\�1I���	;9.�<;<si94<5li�����
�b#i��M$���HcR����yS/U�.�$����=x�kF��A�|1�Q�>:��2~����aM����`R������x��D������'�8��?����8���kx���Ae���L�����9#�P�����:�y��b���t���PI[�D%4@�E����.U�4�������K�S���5h�s����:_��
k��#�m�dc	+�BGidwa��KDg�C�o/��I�[�y�-�z�l�����d�����t|���H3���M�~�I����52/�7������>W�z:_z`&c�E@�P����d��%��jWSU	����V��%�q��P��9����_
%��Chd��B�a��!4���1n4k���������,F���Q��%g���`b�#4��p��%�:��vZ�K��������4���[����v�5������c�`���� 2���V%�� kfMD�u���naK�J�q4�gi��T�SO�o�������\��R�����^�6����!;���	�"|n�{����w��b�v:#i�Z�R"���#��F>��V������km(����tj�V�')���/h���qT��K��;.(���7��^��n���}�V��uRv�(	�Y�JHL
��[S���0o;��G�(�K�
��X�#��V��Ah�3Cg$]�3Zp����T������**��Q����W�&��f��������14li�M�y���C��6��R���W�|����������#�wIb1��1�G��*OS��ths���izY1Cj$]�	�v����I�xR3'(onz��yg6��X�����Wg��z������6*&5/_we��t^���^r��_�b	3��������W��#��T7+���i��v�rp�"be�1--�~c����3�������W��<R#��T��%�egH�dk�.\=4�=��vf]Cl��z3��v#��W0x��HjL��	a7L�GL���]�c�����V���r��o���M����x�cKy��^"3`��[���"��&��[jP]g]aVTk����~Er�y�a��*
�fUk����
��H+���{�Z��^z���z�����7Q��t���CZM]!���H�
�H)�����$3k��_�
��}�g2Q���j��HQ���Kg��j��w��SD��:�3���t��M|w�&U�3r�����42��uV3����}P[��)�"Z��`?n�#�];U��i`ah8��
�R�<���oC:�]���<3��sct���������g*�����'�f��<�SOM-�){��z6�nUIk�qZW���w��;����
[�����\�x��wa5]|f��Ss(�`�o����a�H�b���Pl���=�Gjdk��f
����w��R���!29�^i�J���q���ou�r��o_���.�����"&3�FJC�����T���`
4v(LL��n{�TI7��d�Z#�o��-��Sk��������;u�������p�w�&���S��AI�X��fh��������j�UV�g<3�F(��m#��h:�m�
���_=�9#e�LS&�E������2�w4�$�5��Vl�G8������:�Q�u��a��5��-��-*B�L���z%.�����
��2�*%q�Z�#��u�c�p�h�[���&�����SKp|{
�_J��:5cU2x�
�6�Fj�.�Hj�>{:�=v�@m��r�n�~� �C�����n-���`;2���gh����Nr�r��5R�2�
N��5����5���Gkd��YK�6t�F*`�����g�����[�����F�m
�i�T������������R�q����1��V�Yg9��9�]J@�]�5R���+V�s���,�MVO
���-��C�������������a;+����o%���c�0:^�5���r�����m��He�����9��F��H��k�
����i�?��
���YWU.K���Bk�����3_�Wh�l��u�K��
{��,�����
��[m0��G���0���o�u9�m�������������y!�����~���f�� m�K�H�d��� m#N�#)1�6��m]M�����
���0R6�"���+�F�m�D��1C:Rmp�wE�3�����5p�G�����
�����X�(i[��l<����������e�r����~'�	z�X�\�&^)���]������Lj����wF�,Z����'��H%���w�
5���42��������
	����c�l���L���
-V��5�Z_\��������h�:L�-�YY)�K�2�P[�����dU���PY�;�^���s�unL�!j��e>�Q(b�&�H-�����j�|�&gi��F���/����S��Hv�e�$����'�������
�v��}������n�(29`�n��#_�F"��(�p����"D��:\�JU��>>+��y�/�0�����W��eKj�����A\\D���}��D&�U|�����uG�=��C��X�4xi�\�
�l�u�,��Ir��6������gzV��A���4XY�8p]���,��+I�F\���%���g�G �Y����\�-���	4�p�\��p,^�h�^3��]���_�^��K���!g\���Xd�+mh9p���pD����qQ[�V��~`�<@2��'R��7[\[J �S��r�����,�@P���ga��0�
w��T��r�J���
�z.��),�F�B{���������H�@���RUvz9p���>H�h�e9p���#<\Z_h�l��B������_BAA
��]�=�gfa�oB{����f�`U��_P�=a=C{$���
�1�-5�XM\!��B{d2k�PX�������������s�����Z� �0up����H�>OM1���.|l����������l]|����n
o�;��^��kkl�7Y�5����E�/��9pm�<c��'��Z��l�
�}i���P�B�A�m�����y\]�0C��v7P��_:r�O�	��#[z�
�N�*��n�"��6TEG�"������]��iM���`���������9��z���9I#s�z�3.a������Y���������XM8@X�X���R|+
�vrR->��M9WH��t����9�����-�����`%��d[����->��N��`���5����yk�h\�B�&�:ZYG�#P�Fz���P��B�|��Cj#�Q��	H����B$�0N����������!�[�����lC��PA�cm�*�4�a��e�'-A}|�[�AA��Bw����e}(�FKT�z��`������~�����'8pm�`��>X��p�>h�
��@�w�SE%����k�rm�Za�����@�m�*I;��1�BenE��EP]�A����Au�r�`�
�D���+���.S�T���P�]	����a���e�AU��u�����S�M���0�Q���w8�	r���HW�He=�m�����@&����
����1�m�e����c��?���Q�I�tL��2��nTY;,[0�M�v-���4;p5���������C&{������������m�z+�k�d1T�CO�#1��S��V_]���:��z1���
}Z������:����lVWt�i(��[�NGz`]3�}-��&���*bH"\��8�B){�v���lX�#������J������H#��!����
�B)���f��5��;�3X`��"�^)��E4��3`��A��I��aP�}<,E���t�%4�O=���"��<(���nm�:��^x��`{9���#)R�Ig4������}�AA��"j�Zo�M��3=�-fZ���2��=����_ae���f����	S����%���}����B;=7�0��r���^O�Kf�?�-h�<�5������-(R�fc(g'��k�5T��Z"����%P%(-�4J����p"��H����f-��S�Z�{��RBK$
�a�]�'t���}Yc"�W,\R�L"���z�QK9��34�j������o��k�%�5��0K��j�����K��g��tbz�<'o1D�g�a�1����aF�
A�������V��#y����<���l!KM�34�LT��$L�;�u*��W�L�+�~�W=9*��GD�bS����� �3^�p�.JA��>	Y���d��.�?������R���LV�W�Q���KG���ZU�A;��t@���e�<E&EP����@�I�,�*���3������! �6yBO6����C@$��(�uZ ���Y���r�����H7�VL�H�����U�yb�<I?��$<4\Rm��F1u�G7�'��Ug���]��@�3����.�����"[F����5��WR�U��tqB@d��X)����~���VN�	��
>#��,k�����H�a��}��@��"]��g$�G2����c���$�LM0D�`�jmT�c&G�����d'	Cl�hv�������39��'����UIxf��d�L�T��6�hv�U7�/���wFM��K��#���P(����.\)�}p���(�B��n�#�8��4�3����*�(X��{w��5mh��	�gdD�y�z���+�vH��D;�������b;�7 ��c�����P�=����&>������6���|��Z6�=@��aI���2��t���wj�lQH
F�!�_Q��z4�	�R`(7��H�;+x_����������.�X�1��>���o�)�����
uK���(��n��U[h�i<��#����!�����&��3����o�@��t�C�.����rX�MwGB�
�SC�|�C2�}��d��x���C�2�U��8UK�3�C&V��6\��x�b����p���v��Q�Rr@�sb��HM����p�'vv�����gW���_-�`w]��{R�����5L��
������F�H���Mm/������������}7�v��R��Q��WB<dwrRr����)�Q$�U��g�?�
N`A:�M�[G�'���,����u0�	�)@��r<C<��!2��
�vyl���B��O��'>�������AZ}[2���Pp�!4Z������,��5�4�bc�\q����7c���A��T-�5�}9`l0������H��A�����"������^����fv���qU(�3��~���Hzj���e��]��1����j����d���$�����v�p{�*YG�w�n2vv�wZ��r�����#2L�IF�'N���c}�������$���UL-;~���*�����t�Fz�>���4rK�I�e������EvYZ/=uv�znB����W5�H�kL���LEJ�32D�*b�D��(������|'#}���s�+abe�������)~�!Z�����sq�Q������7�kU�Z�;��0���VD����\�����
V%l�U>�:SG�2���06���
�H�Mf����e�^�i��| g��S+���nC�C|+�v`d��5bjJ�Q�p��m=g����[��0���b��m�n�bm��vBYG�
�Z&aU"��%�LjrU�';h�jA-��x;t:h�2S�F�8M�g�W"@d��N��H�`�tS��jd����Hg1�bG��Z���������2�i@B�Z�}�U�8h�-cU"��M��>�!�g�f�-�q�#��R�0�e��e���bW)?�C�##�~�	�pG�~����.�u�@��3�:�#��~gg�������(�HeDm��Z|����P���w�
]��
�����9Z��a=�,�\s2*D��/W�mh7 ����-��S;T���%U)8R-�����]zx[xj����?�a4�&�S�D9��4y6���n��Te������]�e&��������N�a��6Z!�;E�
��&����^6��"?Z!���w]��
�����\l/��`h��������Th���V���U	���s0i���VH���M|HW%�BS��q<n��Si$��tJ��1�_N��n���%�	���Dtq�Km�q�����;{�kN
���E��\a4�8�����mmJ��F�
M�J���:�k�PM�G�F��-v����W�U���wG<5�S�F��6dd]�:�
��S��v��7���b%����T��8�2�u��7��u�z]��R~��e�����'�gA4�8���7'��	�RY����s�<�k�	���8>p��-�#N�?tD��%j��;C�hB2���e�����"C��[�-D��,�S?� ����s�������_}��7���������W�����������?����������_������~|[J?������|���~���?�����������������������|����U�x��?����_�������������>������y�><���W�~�/�C����q��7�h{��g�}�����������������o?���w����
������U������>�������������������G����~��7?�����a��~�Q?������?~���g.>|z�o>�����>���o����{��g�Oz���������x�����?��W���o�����sib��?|��^����,�_|y�W����W�������?��������?�xM�������o����Q�'��?~��W>�����7�������������:��c���?���o����O�{f�����_�����}��������������1����~����������������~��W�����_�������������z�{�����������?�7��/����~��|�<����g����O��w�W��/���}����K����_������~�����)~�������������r�����_~Z�_~�������/>����?��o������S�~��_����?�����������o����7������������������W?���������?���_��������*�'

#118

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#116)

4 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote:

The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page. We already cache the bounds of our initial binary
search in insert state, so we can reuse that information within
_bt_findinsertloc() when considering deduplication in unique indexes.

Attached is v26, which adds this new criteria/heuristic for unique
indexes. We now seem to consistently get good results with unique
indexes.

Other changes:

* A commit message is now included for the main patch/commit.

* The btree_deduplication GUC is now a boolean, since it is no longer
up to the user to indicate when deduplication is appropriate in unique
indexes (the new heuristic does that instead). The GUC now only
affects non-unique indexes.

* Simplified the user docs. They now only mention deduplication of
unique indexes in passing, in line with the general idea that
deduplication in unique indexes is an internal optimization.

* Fixed bug that made backwards scans that touch posting lists fail to
set LP_DEAD bits when that was possible (i.e. the kill_prior_tuple
optimization wasn't always applied there with posting lists, for no
good reason). Also documented the assumptions made by the new code in
_bt_readpage()/_bt_killitems() -- if that was clearer in the first
place, then the LP_DEAD/kill_prior_tuple bug might never have
happened.

* Fixed some memory leaks in nbtree VACUUM.

Still waiting for some review of the first patch, to get it out of the
way. Anastasia?

--
Peter Geoghegan

Attachments:

v26-0001-Remove-dead-pin-scan-code-from-nbtree-VACUUM.patchapplication/octet-stream; name=v26-0001-Remove-dead-pin-scan-code-from-nbtree-VACUUM.patchDownload

From 3666d8baaca3650d4a458a887933e7a286fe0018 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 20 Nov 2019 16:21:47 -0800
Subject: [PATCH v26 1/4] Remove dead "pin scan" code from nbtree VACUUM.

Finish off the work of commit 3e4b7d87 by completely removing the "pin
scan" code previously used by nbtree VACUUM:

* Don't track lastBlockVacuumed within nbtree.c VACUUM code anymore.

* Remove the lastBlockVacuumed field from xl_btree_vacuum WAL records
(nbtree leaf page VACUUM records).

* Remove the unnecessary extra call to _bt_delitems_vacuum() made
against the last block.  This occurred when VACUUM didn't have index
tuples to kill on the final block in the index, based on the assumption
that a final "pin scan" was still needed.   (Clearly a final pin scan
can never take place here, since the entire pin scan mechanism was
totally disabled by commit 3e4b7d87.)

Also, add a new ndeleted metadata field to xl_btree_vacuum, to replace
the unneeded lastBlockVacuumed field.  This isn't really needed either,
since we could continue to infer the array length in nbtxlog.c by using
the overall record length.  However, it will become useful when the
upcoming deduplication patch needs to add an "items updated" field to go
alongside it (besides, it doesn't seem like a good idea to leave the
xl_btree_vacuum struct without any fields; the C standard says that
that's undefined).

Discussion: https://postgr.es/m/CAH2-Wzn2pSqEOcBDAA40CnO82oEy-EOpE2bNh_XL_cfFoA86jw@mail.gmail.com
---
 src/include/access/nbtree.h           |  3 +-
 src/include/access/nbtxlog.h          | 25 ++-----
 src/backend/access/nbtree/nbtpage.c   | 35 +++++-----
 src/backend/access/nbtree/nbtree.c    | 74 ++-------------------
 src/backend/access/nbtree/nbtxlog.c   | 95 +--------------------------
 src/backend/access/rmgrdesc/nbtdesc.c |  3 +-
 6 files changed, 28 insertions(+), 207 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 18a2a3e71c..9833cc10bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -779,8 +779,7 @@ extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *itemnos, int nitems,
-								BlockNumber lastBlockVacuumed);
+								OffsetNumber *deletable, int ndeletable);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
 /*
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..71435a13b3 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -150,32 +150,17 @@ typedef struct xl_btree_reuse_page
  * The WAL record can represent deletion of any number of index tuples on a
  * single index page when executed by VACUUM.
  *
- * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
- * For a non-MVCC index scans there is an additional correctness requirement
- * for applying these changes during recovery, which is that we must do one
- * of these two things for every block in the index:
- *		* lock the block for cleanup and apply any required changes
- *		* EnsureBlockUnpinned()
- * The purpose of this is to ensure that no index scans started before we
- * finish scanning the index are still running by the time we begin to remove
- * heap tuples.
- *
- * Any changes to any one block are registered on just one WAL record. All
- * blocks that we need to run EnsureBlockUnpinned() are listed as a block range
- * starting from the last block vacuumed through until this one. Individual
- * block numbers aren't given.
- *
- * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * Note that the WAL record in any vacuum of an index must have at least one
+ * item to delete.
  */
 typedef struct xl_btree_vacuum
 {
-	BlockNumber lastBlockVacuumed;
+	uint32		ndeleted;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..66c79623cf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -968,32 +968,27 @@ _bt_page_recyclable(Page page)
  * deleting the page it points to.
  *
  * This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given itemnos *must* appear in increasing order in the array.
+ * Also, the given deletable array *must* be sorted in ascending order.
  *
- * We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
- * we need to be able to pin all of the blocks in the btree in physical
- * order when replaying the effects of a VACUUM, just as we do for the
- * original VACUUM itself. lastBlockVacuumed allows us to tell whether an
- * intermediate range of blocks has had no changes at all by VACUUM,
- * and so must be scanned anyway during replay. We always write a WAL record
- * for the last block in the index, whether or not it contained any items
- * to be removed. This allows us to scan right up to end of index to
- * ensure correct locking.
+ * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
+ * generate recovery conflicts by accessing the heap inline, whereas VACUUMs
+ * can rely on the initial heap scan taking care of the problem (pruning would
+ * have generated the conflicts needed for hot standby already).
  */
 void
-_bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *itemnos, int nitems,
-					BlockNumber lastBlockVacuumed)
+_bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
+					int ndeletable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
 
+	Assert(ndeletable > 0);
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
 	/* Fix the page */
-	if (nitems > 0)
-		PageIndexMultiDelete(page, itemnos, nitems);
+	PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1019,7 +1014,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
 
-		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.ndeleted = ndeletable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1030,8 +1025,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		 * is.  When XLogInsert stores the whole buffer, the offsets array
 		 * need not be stored too.
 		 */
-		if (nitems > 0)
-			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+		XLogRegisterBufData(0, (char *) deletable, ndeletable *
+							sizeof(OffsetNumber));
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1050,8 +1045,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * Also, the given itemnos *must* appear in increasing order in the array.
  *
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but the WAL logging considerations are quite different.  See
- * comments for _bt_delitems_vacuum.
+ * the page, but it needs to generate its own recovery conflicts by accessing
+ * the heap.  See comments for _bt_delitems_vacuum.
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c67235ab80..bbc1376b0a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,6 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber lastBlockVacuumed;	/* highest blkno actually vacuumed */
-	BlockNumber lastBlockLocked;	/* highest blkno we've cleanup-locked */
 	BlockNumber totFreePages;	/* true total # of free pages */
 	TransactionId oldestBtpoXact;
 	MemoryContext pagedelcontext;
@@ -978,8 +976,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
-	vstate.lastBlockVacuumed = BTREE_METAPAGE;	/* Initialise at first block */
-	vstate.lastBlockLocked = BTREE_METAPAGE;
 	vstate.totFreePages = 0;
 	vstate.oldestBtpoXact = InvalidTransactionId;
 
@@ -1040,39 +1036,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
-	/*
-	 * Check to see if we need to issue one final WAL record for this index,
-	 * which may be needed for correctness on a hot standby node when non-MVCC
-	 * index scans could take place.
-	 *
-	 * If the WAL is replayed in hot standby, the replay process needs to get
-	 * cleanup locks on all index leaf pages, just as we've been doing here.
-	 * However, we won't issue any WAL records about pages that have no items
-	 * to be deleted.  For pages between pages we've vacuumed, the replay code
-	 * will take locks under the direction of the lastBlockVacuumed fields in
-	 * the XLOG_BTREE_VACUUM WAL records.  To cover pages after the last one
-	 * we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
-	 * against the last leaf page in the index, if that one wasn't vacuumed.
-	 */
-	if (XLogStandbyInfoActive() &&
-		vstate.lastBlockVacuumed < vstate.lastBlockLocked)
-	{
-		Buffer		buf;
-
-		/*
-		 * The page should be valid, but we can't use _bt_getbuf() because we
-		 * want to use a nondefault buffer access strategy.  Since we aren't
-		 * going to delete any items, getting cleanup lock again is probably
-		 * overkill, but for consistency do that anyway.
-		 */
-		buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
-								 RBM_NORMAL, info->strategy);
-		LockBufferForCleanup(buf);
-		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
-		_bt_relbuf(rel, buf);
-	}
-
 	MemoryContextDelete(vstate.pagedelcontext);
 
 	/*
@@ -1203,13 +1166,6 @@ restart:
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		LockBufferForCleanup(buf);
 
-		/*
-		 * Remember highest leaf page number we've taken cleanup lock on; see
-		 * notes in btvacuumscan
-		 */
-		if (blkno > vstate->lastBlockLocked)
-			vstate->lastBlockLocked = blkno;
-
 		/*
 		 * Check whether we need to recurse back to earlier pages.  What we
 		 * are concerned about is a page split that happened since we started
@@ -1245,9 +1201,9 @@ restart:
 				htup = &(itup->t_tid);
 
 				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
+				 * During Hot Standby we currently assume that it's okay that
+				 * XLOG_BTREE_VACUUM records do not produce conflicts. This is
+				 * only safe as long as the callback function depends only
 				 * upon whether the index tuple refers to heap tuples removed
 				 * in the initial heap scan. When vacuum starts it derives a
 				 * value of OldestXmin. Backends taking later snapshots could
@@ -1276,29 +1232,7 @@ restart:
 		 */
 		if (ndeletable > 0)
 		{
-			/*
-			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
-			 * all information to the replay code to allow it to get a cleanup
-			 * lock on all pages between the previous lastBlockVacuumed and
-			 * this page. This ensures that WAL replay locks all leaf pages at
-			 * some point, which is important should non-MVCC scans be
-			 * requested. This is currently unused on standby, but we record
-			 * it anyway, so that the WAL contains the required information.
-			 *
-			 * Since we can visit leaf pages out-of-order when recursing,
-			 * replay might end up locking such pages an extra time, but it
-			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
-			 * that.
-			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
-								vstate->lastBlockVacuumed);
-
-			/*
-			 * Remember highest leaf page number we've issued a
-			 * XLOG_BTREE_VACUUM WAL record for.
-			 */
-			if (blkno > vstate->lastBlockVacuumed)
-				vstate->lastBlockVacuumed = blkno;
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
 
 			stats->tuples_removed += ndeletable;
 			/* must recompute maxoff */
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 44f6283950..72a601bb22 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,107 +386,16 @@ btree_xlog_vacuum(XLogReaderState *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
 
-	/*
-	 * This section of code is thought to be no longer needed, after analysis
-	 * of the calling paths. It is retained to allow the code to be reinstated
-	 * if a flaw is revealed in that thinking.
-	 *
-	 * If we are running non-MVCC scans using this index we need to do some
-	 * additional work to ensure correctness, which is known as a "pin scan"
-	 * described in more detail in next paragraphs. We used to do the extra
-	 * work in all cases, whereas we now avoid that work in most cases. If
-	 * lastBlockVacuumed is set to InvalidBlockNumber then we skip the
-	 * additional work required for the pin scan.
-	 *
-	 * Avoiding this extra work is important since it requires us to touch
-	 * every page in the index, so is an O(N) operation. Worse, it is an
-	 * operation performed in the foreground during redo, so it delays
-	 * replication directly.
-	 *
-	 * If queries might be active then we need to ensure every leaf page is
-	 * unpinned between the lastBlockVacuumed and the current block, if there
-	 * are any.  This prevents replay of the VACUUM from reaching the stage of
-	 * removing heap tuples while there could still be indexscans "in flight"
-	 * to those particular tuples for those scans which could be confused by
-	 * finding new tuples at the old TID locations (see nbtree/README).
-	 *
-	 * It might be worth checking if there are actually any backends running;
-	 * if not, we could just skip this.
-	 *
-	 * Since VACUUM can visit leaf pages out-of-order, it might issue records
-	 * with lastBlockVacuumed >= block; that's not an error, it just means
-	 * nothing to do now.
-	 *
-	 * Note: since we touch all pages in the range, we will lock non-leaf
-	 * pages, and also any empty (all-zero) pages that may be in the index. It
-	 * doesn't seem worth the complexity to avoid that.  But it's important
-	 * that HotStandbyActiveInReplay() will not return true if the database
-	 * isn't yet consistent; so we need not fear reading still-corrupt blocks
-	 * here during crash recovery.
-	 */
-	if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
-	{
-		RelFileNode thisrnode;
-		BlockNumber thisblkno;
-		BlockNumber blkno;
-
-		XLogRecGetBlockTag(record, 0, &thisrnode, NULL, &thisblkno);
-
-		for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
-		{
-			/*
-			 * We use RBM_NORMAL_NO_LOG mode because it's not an error
-			 * condition to see all-zero pages.  The original btvacuumpage
-			 * scan would have skipped over all-zero pages, noting them in FSM
-			 * but not bothering to initialize them just yet; so we mustn't
-			 * throw an error here.  (We could skip acquiring the cleanup lock
-			 * if PageIsNew, but it's probably not worth the cycles to test.)
-			 *
-			 * XXX we don't actually need to read the block, we just need to
-			 * confirm it is unpinned. If we had a special call into the
-			 * buffer manager we could optimise this so that if the block is
-			 * not in shared_buffers we confirm it as unpinned. Optimizing
-			 * this is now moot, since in most cases we avoid the scan.
-			 */
-			buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
-											RBM_NORMAL_NO_LOG);
-			if (BufferIsValid(buffer))
-			{
-				LockBufferForCleanup(buffer);
-				UnlockReleaseBuffer(buffer);
-			}
-		}
-	}
-#endif
-
-	/*
-	 * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
-	 * page. See nbtree/README for details.
-	 */
 	if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer)
 		== BLK_NEEDS_REDO)
 	{
-		char	   *ptr;
-		Size		len;
-
-		ptr = XLogRecGetBlockData(record, 0, &len);
+		char	   *ptr = XLogRecGetBlockData(record, 0, NULL);
 
 		page = (Page) BufferGetPage(buffer);
 
-		if (len > 0)
-		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
-
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
-
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
-		}
+		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..497f8dc77e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
-- 
2.17.1

v26-0003-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/octet-stream; name=v26-0003-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From 67f0f0c2c41226e3afc14989c4943de24aaaa3d6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v26 3/4] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 111 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   6 +
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  36 ++++++
 doc/src/sgml/pageinspect.sgml                 |  80 +++++++------
 4 files changed, 181 insertions(+), 52 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..17f7ad186e 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..1d45cd5c1e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,6 +41,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +57,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..70f1ab0467 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,39 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..1763e9c6f0 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -329,11 +329,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +356,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +414,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v26-0004-DEBUG-Show-index-values-in-pageinspect.patchapplication/octet-stream; name=v26-0004-DEBUG-Show-index-values-in-pageinspect.patchDownload

From 5a1c94cf3ed296cda240236b044dc73eb3afb694 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v26 4/4] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 65 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 17f7ad186e..4eab8df098 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +263,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 1d45cd5c1e..3da5f37c3e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v26-0002-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v26-0002-Add-deduplication-to-nbtree.patchDownload

From b7182da88b4744e5994a9f6382c66dc3d1857292 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v26 2/4] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split will be required if deduplication
can't free up enough space.  New "posting list tuples" are formed by
merging together existing duplicate tuples.  The physical representation
of the items on an nbtree leaf page is made more space efficient by
deduplication, but the logical contents of the page are not changed.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  Much larger
reductions in index size are possible in less common cases, where
individual index tuple keys happen to be large.  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.

The lazy approach taken by nbtree has significant advantages over a
GIN-style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The "key space" of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.
Furthermore, deduplication can be turned on or off as needed, or applied
selectively when required.  The split point choice logic doesn't need to
be changed, since posting list tuples are just tuples with payload, much
like tuples with non-key columns in INCLUDE indexes. (nbtsplitloc.c is
still optimized to make intelligent choices in the presence of posting
list tuples, though only because suffix truncation will routinely make
new high keys far far smaller than the non-pivot tuple they're derived
from).

Unique indexes can also make use of deduplication, though the strategy
used has significantly differences.  The high-level goal is to entirely
prevent "unnecessary" page splits -- splits caused only by a short term
burst of index tuple versions.  This is often a concern with frequently
updated tables where UPDATEs always modify at least one indexed column
(making it impossible for the table am to use an optimization like
heapam's heap-only tuples optimization).  Deduplication in unique
indexes effectively "buys time" for existing nbtree garbage collection
mechanisms to run and prevent these page splits (the LP_DEAD bit setting
performed during the uniqueness check is the most important mechanism
for controlling bloat with affected workloads).

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h                   | 390 ++++++++--
 src/include/access/nbtxlog.h                  |  71 +-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 130 +++-
 src/backend/access/nbtree/nbtdedup.c          | 695 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 366 ++++++++-
 src/backend/access/nbtree/nbtpage.c           | 218 +++++-
 src/backend/access/nbtree/nbtree.c            | 171 ++++-
 src/backend/access/nbtree/nbtsearch.c         | 265 ++++++-
 src/backend/access/nbtree/nbtsort.c           | 202 ++++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 211 +++++-
 src/backend/access/nbtree/nbtxlog.c           | 234 +++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  25 +-
 src/backend/utils/misc/guc.c                  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 185 ++++-
 doc/src/sgml/btree.sgml                       | 115 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  25 +
 doc/src/sgml/ref/create_index.sgml            |  37 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 27 files changed, 3179 insertions(+), 273 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9833cc10bd..1c5ac669b0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,9 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* GUC parameter */
+extern bool btree_deduplication;
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,6 +111,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -115,7 +119,8 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *
  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
@@ -132,8 +137,8 @@ typedef struct BTMetaPageData
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number in metapage */
 #define BTREE_VERSION	4		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
-#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
+#define BTREE_MIN_VERSION	2	/* minimum supported version */
+#define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
@@ -155,6 +160,26 @@ typedef struct BTMetaPageData
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -230,16 +255,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -283,40 +307,128 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication merges together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation.  A posting list is
+ * an array of ItemPointerData elements.  Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
-
-/* Get/set downlink block number */
-#define BTreeInnerTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeInnerTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeSetPosting(IndexTuple itup, int n, int off)
+{
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	Assert(n > 1 && (n & BT_N_POSTING_OFFSET_MASK) == n);
+	ItemPointerSetOffsetNumber(&itup->t_tid, (n | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, off);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple itup)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&itup->t_tid);
+	return (existing & BT_N_POSTING_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple itup)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	return ItemPointerGetBlockNumberNoCheck(&itup->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple itup)
+{
+	return (ItemPointer) ((char *) itup + BTreeTupleGetPostingOffset(itup));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple itup, int n)
+{
+	return BTreeTupleGetPosting(itup) + n;
+}
+
+/* Get/set downlink block number */
+static inline BlockNumber
+BTreeInnerTupleGetDownLink(IndexTuple itup)
+{
+	return ItemPointerGetBlockNumberNoCheck(&itup->t_tid);
+}
+
+static inline void
+BTreeInnerTupleSetDownLink(IndexTuple itup, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&itup->t_tid, blkno);
+}
 
 /*
  * Get/set number of attributes within B-tree index tuple.
@@ -327,43 +439,100 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	Assert((n & BT_N_KEYS_OFFSET_MASK) == n);
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, n);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get/set leaf page highkey's link. During the second phase of deletion, the
+ * target leaf page's high key may point to an ancestor page (at all other
+ * times, the leaf level high key's link is not used).  See the nbtree README
+ * for full details.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple itup)
+{
+	return ItemPointerGetBlockNumberNoCheck(&itup->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple itup, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&itup->t_tid, blkno);
+	BTreeTupleSetNAtts(itup, 0);
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
-	} while(0)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
+
+/*
+ * Set the heap TID attribute for a pivot tuple
+ */
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple itup)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(itup));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&itup->t_tid);
+	ItemPointerSetOffsetNumber(&itup->t_tid, existing | BT_HEAP_TID_ATTR);
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -435,6 +604,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -470,6 +644,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -508,10 +683,68 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  This will be -1 in rare cases where
+	 * the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication.  'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used).  These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupStateData
+{
+	Relation	rel;
+	/* Deduplication status info for entire page/operation */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about current pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* # heap TIDs in nhtids array */
+	int			nitems;			/* See BTDedupInterval definition */
+	Size		alltupsize;		/* Includes line pointer overhead */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/*
+	 * Pending posting list.  Contains information about a group of
+	 * consecutive items that will be deduplicated by creating a new posting
+	 * list tuple.
+	 */
+	BTDedupInterval interval;
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -535,7 +768,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -579,7 +815,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -687,6 +923,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -695,8 +932,16 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplication : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique || btree_deduplication)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -743,6 +988,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState state,
+									 bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -761,14 +1022,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -779,7 +1042,9 @@ extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updateitemnos,
+								IndexTuple *updated, int nupdateable);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
 /*
@@ -829,6 +1094,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 71435a13b3..ba0c3eb1a2 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,32 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it is.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting().
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	/* posting split offset (INSERT_POST only) */
+	/* new tuple that was inserted (or orignewitem in INSERT_POST case) */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -91,9 +103,18 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set.  This corresponds to
+ * the xl_btree_insert INSERT_POST case.  Note that postingoff will be set to
+ * zero (no posting split) when a posting list split occurs where both
+ * original posting list and newitem go on the right page, since recovery
+ * doesn't need to consider the posting list split at all.
  *
  * Backup Blk 1: new right page
  *
@@ -111,10 +132,26 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -148,19 +185,25 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are actually deletions of "logical" tuples contained
+ * in an existing posting list tuple that will still have some remaining
+ * logical tuples once VACUUM finishes.
  *
  * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * item to delete or update.
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -241,6 +284,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 48377ace24..2b37afd9e5 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..f1d1d59ef4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -709,6 +712,131 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+while freeing just as much space.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits (the goals for
+deduplication in unique indexes are rather different; see "Deduplication in
+unique indexes" for details).  Deduplication alters the physical
+representation of tuples without changing the logical contents of the
+index, and without adding overhead to read queries.  Non-pivot tuples are
+merged together into a single physical tuple with a posting list (a simple
+array of heap TIDs with the standard item pointer format).  Deduplication
+is always applied lazily, at the point where it would otherwise be
+necessary to perform a page split.  It occurs only when LP_DEAD items have
+been removed, as our last line of defense against splitting a leaf page.
+We can set the LP_DEAD bit with posting list tuples, though only when all
+table tuples are known dead. (Bitmap scans cannot perform LP_DEAD bit
+setting, and are the common case with indexes that contain lots of
+duplicates, so this downside is considered acceptable.)
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting, though, so it's not clear how compression could be
+integrated with nbtree.  Besides, posting list compression does not offer a
+compelling trade-off for nbtree, since in general nbtree is optimized for
+consistent performance with many concurrent readers and writers.  A major
+goal of nbtree's lazy approach to deduplication is to limit the performance
+impact of deduplication with random updates.  Even concurrent append-only
+inserts of the same key value will tend to have inserts of individual index
+tuples in an order that doesn't quite match heap TID order.  In general,
+delaying deduplication avoids many unnecessary posting list splits, and
+minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Like all index access methods, nbtree does not have direct knowledge of
+versioning or of MVCC; it deals only with physical tuples.  However, unique
+indexes implicitly give nbtree basic information about tuple versioning,
+since by definition zero or one tuples of any given key value can be
+visible to any possible MVCC snapshot (excluding index entries with NULL
+values).  When optimizations such as heapam's Heap-only tuples (HOT) happen
+to be ineffective, nbtree's on-the-fly deletion of tuples in unique indexes
+can be very important with UPDATE-heavy workloads.  Unique checking's
+LP_DEAD bit setting reliably attempts to kill old, equal index tuple
+versions.  Importantly, this prevents (or at least delays) page splits that
+are necessary only because a leaf page must contain multiple physical
+tuples for the same logical row.
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column is very likely to only have page splits caused by the
+insertion of new logical rows when the rightmost leaf page is split.
+Splitting any other leaf page is unlikely to occur unless there is a
+short-term need to have multiple versions/index tuples for the same logical
+row version.  Splitting a leaf page purely to store multiple versions
+should be considered pathological, since it permanently degrades the index
+structure in order to absorb a temporary burst of duplicates.  (Page
+deletion can never reverse the page split, since it can only be applied to
+leaf pages that are completely empty.)
+
+Deduplication within unique indexes is useful purely as a final line of
+defense against version-driven page splits.  The strategy used is
+significantly different to the "standard" deduplication strategy.  Unique
+index leaf pages only get a deduplication pass when an insertion (that
+might have to split the page) observed an existing duplicate on the page in
+passing.  This is based on the assumption that deduplication will only work
+out when _all_ new insertions are duplicates from UPDATEs.  This may mean
+that we miss an opportunity to delay a page split, but that's okay because
+our ultimate goal is to delay leaf page splits _indefinitely_ (i.e. to
+prevent them altogether); there is little point in trying to delay a split
+that is probably inevitable anyway.  This allows us to avoid the overhead
+of attempting to deduplicate in the common case where a leaf page in a
+unique index cannot ever have any duplicates (e.g. with a unique index on
+an append-only table).
+
+Furthermore, it's particularly important that LP_DEAD bit setting not be
+restricted by deduplication in the case of unique indexes.  This is why
+unique index deduplication is applied in an incremental, cooperative
+fashion (it is "extra lazy").  Besides, there is no possible advantage to
+having unique index deduplication do more than the bare minimum to avoid an
+immediate page split -- that is its only goal.  Deleting items on the page
+is always preferable to deduplication.
+
+Note that the btree_deduplication GUC is not considered when deduplicating
+in a unique index, since unique index deduplication is treated as a
+separate, internal-only optimization that doesn't need to be configured by
+users. (The deduplication storage parameter is still respected, though only
+for debugging purposes.)
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..aea59dbb24
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,695 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is completely different.
+ * Deduplication works in tandem with garbage collection, especially the
+ * LP_DEAD bit setting that takes place in _bt_check_unique().  We give up as
+ * soon as it becomes clear that enough space has been made available to
+ * insert newitem without needing to split the page.  Also, we merge together
+ * larger groups of duplicate tuples first (merging together two index tuples
+ * usually saves very little space), and avoid merging together existing
+ * posting list tuples.  The goal is to generate posting lists with TIDs that
+ * are "close together in time", in order to maximize the chances of an
+ * LP_DEAD bit being set opportunistically.  See nbtree/README for more
+ * information on deduplication within unique indexes.
+ *
+ * nbtinsert.c caller should call _bt_vacuum_one_page() before calling here.
+ * Note that this routine will delete all items on the page that have their
+ * LP_DEAD bit set, even when page's BTP_HAS_GARBAGE bit is not set (a rare
+ * edge case).  Caller can rely on that to avoid inserting a new tuple that
+ * happens to overlap with an existing posting list tuple with its LP_DEAD bit
+ * set. (Calling here with a newitemsz of 0 will reliably delete the existing
+ * item, making it possible to avoid unsetting the LP_DEAD bit just to insert
+ * the new item.  In general, posting list splits should never have to deal
+ * with a posting list tuple with its LP_DEAD bit set.)
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	BTPageOpaque oopaque;
+	BTDedupState state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->rel = rel;
+
+	state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about current pending posting list */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+	/* Metadata about based tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples before deduplication runs, since it seems like a
+	 * good idea to avoid merging together tuples with their LP_DEAD bit set
+	 * (LP_DEAD bits would be correctly unset below if we allowed it, but we
+	 * don't rely on that).  It is possible to have dead tuples without having
+	 * the BTP_HAS_GARBAGE flag set, which is why caller might not have
+	 * already taken care of this in the preferred way (i.e. by calling
+	 * _bt_vacuum_one_page()).
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Make sure that new page won't have garbage flag set */
+	oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buffer, state,
+												   RelationNeedsWAL(rel));
+
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		leftfree;
+
+					/* This calculation needs to match nbtsplitloc.c */
+					leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					/* Subtract predicted size of new high key */
+					leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+					/*
+					 * Reduce maxitemsize by an amount equal to target free
+					 * space on left half of page
+					 */
+					state->maxitemsize -= leftfree *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buffer, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->alltupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->interval.baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) *
+						   sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists in first checkingunique pass */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buffer);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->interval.baseoff == state->baseoff);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+		/* Must have saved some space */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+		/* Save final number of items for posting list */
+		state->interval.nitems = state->nitems;
+
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete items to replace */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buffer);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->interval.baseoff;
+			xlrec_dedup.nitems = state->interval.nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->alltupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order.  Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+
+	/* Add space needed for posting list */
+	if (nhtids > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeSetPosting(itup, nhtids, SHORTALIGN(keysize));
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(BTreeTupleIsPosting(itup));
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from htids */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps.  This approach avoids any explicit WAL-logging of a posting
+ * list tuple.  This is important because posting lists are often much larger
+ * than plain tuples.
+ *
+ * Caller should avoid assuming that the IndexTuple-wise key representation in
+ * newitem is bitwise equal to the representation used within oposting.  Note,
+ * in particular, that one may even be larger than the other.  This could
+ * occur due to differences in TOAST input state, for example.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b93b2a0ffd..5072fc514d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+bool		btree_deduplication = true;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +65,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +130,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +359,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prev_all_dead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +383,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +449,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prev_all_dead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +545,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +603,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prev_all_dead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +624,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prev_all_dead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +671,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +741,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -704,6 +757,8 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		bool		dedupunique = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -717,9 +772,25 @@ _bt_findinsertloc(Relation rel,
 		 * tuple belongs on.  The heap TID attribute for new tuple (scantid)
 		 * could force us to insert on a sibling page, though that should be
 		 * very rare in practice.
+		 *
+		 * checkingunique inserters that encounter a duplicate will apply
+		 * deduplication when it looks like there will be a page split, but
+		 * there is no LP_DEAD garbage on the leaf page to vacuum away (or
+		 * there wasn't enough space freed by LP_DEAD cleanup).  This
+		 * complements the opportunistic LP_DEAD vacuuming mechanism.  The
+		 * high level goal is to avoid page splits caused by new, unchanged
+		 * versions of existing logical rows altogether.  See nbtree/README
+		 * for full details.
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				dedupunique = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -746,18 +817,37 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (helpful when initial page is empty) */
+				dedupunique = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, try to obtain
+		 * enough free space to avoid a page split by deduplicating existing
+		 * items (if deduplication is safe).
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates if checkingunique */
+				dedupunique = true;
+			}
+
+			if (itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz &&
+				(!checkingunique || dedupunique))
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +929,35 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->postingoff == -1))
+	{
+		Assert(insertstate->itup_key->safededup);
+
+		/*
+		 * Don't check if the option is enabled, since no actual deduplication
+		 * will be done, just cleanup.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, checkingunique);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/* New insert location cannot be LP_DEAD posting list now */
+		Assert(insertstate->postingoff >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +1023,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1038,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1057,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1079,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1091,39 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list by swapping new item's heap TID with
+		 * the rightmost heap TID from original posting list, and generating a
+		 * new version of the posting list that has new item's heap TID.
+		 *
+		 * Posting list splits work by modifying the overlapping posting list
+		 * as part of the same atomic operation that inserts the "new item".
+		 * The space accounting is kept simple, since it does not need to
+		 * consider posting list splits at all (this is particularly important
+		 * for the case where we also have to split the page).  Overwriting
+		 * the posting list with its post-split version is treated as an extra
+		 * step in either the insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID for xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+		/* Alter offset so that it goes after existing posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1156,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1236,13 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		/*
+		 * Posting list split requires an in-place update of the existing
+		 * posting list
+		 */
+		if (nposting)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1120,8 +1288,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1144,6 +1323,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1332,28 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+			{
+				/*
+				 * Must explicitly log posting off before newitem in case of
+				 * posting list split.
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1395,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1417,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1457,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1505,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1579,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1615,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1388,6 +1633,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1480,8 +1726,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1911,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1935,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  Recovery must
+		 * reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != 0)
+		{
+			if (xlrec.postingoff == 0)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2133,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2489,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2303,6 +2603,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 66c79623cf..3c58a82b3d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +633,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +677,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +705,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -968,27 +997,73 @@ _bt_page_recyclable(Page page)
  * deleting the page it points to.
  *
  * This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given deletable array *must* be sorted in ascending order.
+ * Also, the given deletable and updateitemnos arrays *must* be sorted in
+ * ascending order.
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate recovery conflicts by accessing the heap inline, whereas VACUUMs
  * can rely on the initial heap scan taking care of the problem (pruning would
- * have generated the conflicts needed for hot standby already).
+ * have generated the conflicts needed for hot standby already).  Also,
+ * VACUUMs must deal with the case where posting list tuples have some dead
+ * TIDs, and some remaining TIDs that must not be killed.
  */
 void
-_bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
-					int ndeletable)
+_bt_delitems_vacuum(Relation rel, Buffer buf,
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updateitemnos,
+					IndexTuple *updated, int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
 
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff, buffer for updated */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuple updates */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/*
+		 * Delete the old posting tuple first.  This will also clear the
+		 * LP_DEAD bit. (It would be correct to leave it set, but we're going
+		 * to unset the BTP_HAS_GARBAGE bit anyway.)
+		 */
+		PageIndexTupleDelete(page, updateitemnos[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1015,6 +1090,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1025,8 +1101,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
 		 * is.  When XLogInsert stores the whole buffer, the offsets array
 		 * need not be stored too.
 		 */
-		XLogRegisterBufData(0, (char *) deletable, ndeletable *
-							sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			Assert(updated_buf != NULL);
+			XLogRegisterBufData(0, (char *) updateitemnos,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1034,6 +1124,96 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updated_buf != NULL)
+		pfree(updated_buf);
+}
+
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	/* be tidy */
+	pfree(htids);
+
+	return latestRemovedXid;
 }
 
 /*
@@ -1046,7 +1226,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
  *
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own recovery conflicts by accessing
- * the heap.  See comments for _bt_delitems_vacuum.
+ * the heap, and doesn't handle updating posting list tuples.  See comments
+ * for _bt_delitems_vacuum.
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1062,8 +1243,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -2061,6 +2242,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index bbc1376b0a..809fddcfd7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1151,8 +1153,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1185,6 +1196,10 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain stats counters for index tuple versions/heap TIDs */
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1194,11 +1209,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * During Hot Standby we currently assume that it's okay that
@@ -1221,8 +1234,71 @@ restart:
 				 * applies to *any* type of index that marks index tuples as
 				 * killed.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+
+						/* Free empty array of live items */
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1230,13 +1306,18 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* free memory */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updated[i]);
 		}
 		else
 		{
@@ -1249,6 +1330,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1258,15 +1340,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1309,6 +1392,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..fe3eed86ca 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								 OffsetNumber offnum, ItemPointer heapTid,
+								 IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +621,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +652,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +687,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +802,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1335,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1590,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1700,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1748,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1762,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1776,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1dd39a9535..3e8e4ac012 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	/* Don't maintain  dedup_intervals array, or alltupsize */
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1195,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1294,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1309,111 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		/* Metadata about current pending posting list */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->alltupsize = 0; /* unused */
+		/* Metadata about based tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(Min(BTMaxItemSize(state->btps_page), INDEX_SIZE_MASK),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1421,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 29167f1ef5..950c6d7673 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ee972a1465..086f8bd1b9 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplication)}
 
 	};
 
@@ -2138,6 +2204,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2233,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2252,18 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include a lastleft heap
+			 * TID in the final pivot, but we can remove the posting list now.
+			 * (Pivot tuples should never contain a posting list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2291,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2302,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2320,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2329,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2411,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2398,22 +2498,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2565,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2596,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2670,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the "deduplicate" storage
+	 * parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72a601bb22..73bb03bc6b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (likely(!posting))
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +370,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +457,82 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState state;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+
+		state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		/* Metadata about current pending posting list */
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->alltupsize = 0;
+		/* Metadata about based tuple of current pending posting list */
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Iterate over tuples on the page belonging to the interval to
+		 * deduplicate them into a posting list.
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -395,7 +549,38 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		/*
+		 * Must update posting list tuples before deleting whole items, since
+		 * offset numbers are based on original page contents
+		 */
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			/* Handle posting tuples */
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				PageIndexTupleDelete(page, updatedoffsets[i]);
+
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+								false, false) == InvalidOffsetNumber)
+					elog(PANIC, "failed to add updated posting list item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -729,17 +914,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -747,6 +937,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -772,6 +965,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 497f8dc77e..23e951aa9e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff, xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +143,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8d951ce404..9560df7a7c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -1091,6 +1092,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		check_bonjour, NULL, NULL
 	},
+	{
+		{"btree_deduplication", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&btree_deduplication,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
 			gettext_noop("Collects transaction commit time."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 087190ce63..739676b9d0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -651,6 +651,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#btree_deduplication = on
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index df26826993..7e55c0ff90 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3542545de5..43ae7de199 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -278,7 +279,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +290,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +421,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +927,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +998,72 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1232,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are merged together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2654,25 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	Assert(state->heapkeyspace);
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Make sure that tuple type (pivot vs non-pivot) matches caller's
+	 * expectation
+	 */
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return BTreeTupleGetHeapTID(itup);
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..059477be1e 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,122 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values) a
+   storage size of about <emphasis>one third</emphasis> of the
+   unoptimized case is expected.  There is often a direct benefit for
+   queries, as well as an indirect benefit due to reduced I/O during
+   routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   enabled.  The <literal>deduplication</literal> storage parameter
+   can be used to override the configuration paramater for individual
+   indexes.  See <xref linkend="sql-createindex-storage-parameters"/>
+   from the <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.  In
+   practice almost all datatypes support deduplication, though
+   <type>numeric</type> is a notable exception (the <quote>display
+   scale</quote> feature makes it impossible to enable deduplication
+   without losing useful information about equal <type>numeric</type>
+   datums).  Deduplication is not supported with nondeterministic
+   collations, nor is it supported with <literal>INCLUDE</literal>
+   indexes.
+  </para>
+  <para>
+   Note that a multicolumn index is only considered to have duplicates
+   when there are index entries that repeat entire
+   <emphasis>combinations</emphasis> of values (the values stored in
+   each and every column must be equal).
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Internal use of Deduplication in unique indexes</title>
+
+  <para>
+   Page splits that occur due to inserting multiple physical versions
+   (rather than inserting new logical rows) tend to degrade the
+   structure of indexes, especially in the case of unique indexes.
+   Unique indexes use deduplication <emphasis>internally</emphasis>
+   and <emphasis>selectively</emphasis> to delay (and ideally to
+   prevent) these <quote>unnecessary</quote> page splits.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but usually cannot
+   reverse page splits (in general, the page must be completely empty
+   before <command>VACUUM</command> can <quote>delete</quote> it).
+  </para>
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter does not affect whether or not deduplication is used
+   within unique indexes.  The internal use of deduplication for
+   unique indexes is subject to all of the same restrictions as
+   deduplication in general.  The <literal>deduplication</literal>
+   storage parameter can be set to <literal>OFF</literal> to disable
+   deduplication in unique indexes, but this is intended only as a
+   debugging option for developers.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..05f442d57a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8021,6 +8021,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplication" xreflabel="btree_deduplication">
+      <term><varname>btree_deduplication</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>btree_deduplication</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls whether deduplication should be used within B-Tree
+        indexes.  Deduplication is an optimization that reduces the
+        storage size of indexes by storing equal index keys only once.
+        See <xref linkend="btree-deduplication"/> for more
+        information.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..e6cdba4c29 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+    <term><literal>deduplication</literal>
+     <indexterm>
+      <primary><varname>deduplication</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref linkend="guc-btree-deduplication"/>.
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to override GUC.
+      (Alternative spellings of <literal>ON</literal> and
+      <literal>OFF</literal> are allowed as described in <xref
+      linkend="config-setting"/>.)
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplication</literal> off via <command>ALTER
+      INDEX</command> prevents future insertions from triggering
+      deduplication, but does not in itself make existing posting list
+      tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +477,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..e32c8fa826 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..627ba80bc1 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

#119

Bruce Momjian

bruce@momjian.us

about 6 years ago

In reply to: Peter Geoghegan (#118)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Dec 12, 2019 at 06:21:20PM -0800, Peter Geoghegan wrote:

On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote:

The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page. We already cache the bounds of our initial binary
search in insert state, so we can reuse that information within
_bt_findinsertloc() when considering deduplication in unique indexes.

Attached is v26, which adds this new criteria/heuristic for unique
indexes. We now seem to consistently get good results with unique
indexes.

In the past we tried to increase the number of cases where HOT updates
can happen but were unable to. Would this help with non-HOT updates?
Do we have any benchmarks where non-HOT updates cause slowdowns that we
can test on this?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#120

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Bruce Momjian (#119)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Dec 17, 2019 at 1:58 PM Bruce Momjian <bruce@momjian.us> wrote:

Attached is v26, which adds this new criteria/heuristic for unique
indexes. We now seem to consistently get good results with unique
indexes.

In the past we tried to increase the number of cases where HOT updates
can happen but were unable to.

Right -- the WARM project.

The Z-heap project won't change the fundamentals here. It isn't going
to solve the fundamental problem of requiring that the index AM create
a new set of physical index tuples in at least *some* cases. A heap
tuple cannot be updated in-place when even one indexed column changes
-- you're not much better off than you were with the classic heapam,
because indexes get bloated in a way that wouldn't happen with Oracle.
(Even still, Z-heap isn't sensitive to when and how opportunistic heap
pruning takes place, and doesn't have the same issue with having to
fit the heap tuple on the same page or create a new HOT chain. This
will make things much better with some workloads.)

Would this help with non-HOT updates?

Definitely, yes. The strategy used with unique indexes is specifically
designed to avoid "unnecessary" page splits altogether -- it only
makes sense because of the possibility of non-HOT UPDATEs with
mostly-unchanged index tuples. Thinking about what's going on here
from first principles is what drove the unique index deduplication
design:

With many real world unique indexes, the true reason behind most or
all B-Tree page splits is "version churn". I view these page splits as
a permanent solution to a temporary problem -- we *permanently*
degrade the index structure in order to deal with a *temporary* burst
in versions that need to be stored. That's really bad.

Consider a classic pgbench workload, for example. The smaller indexes
on the smaller tables (pgbench_tellers_pkey and pgbench_branches_pkey)
have leaf pages that will almost certainly be split a few minutes in,
even though the UPDATEs on the underlying tables never modify indexed
columns (i.e. even though HOT is as effective as it possibly could be
with this unthrottled workload). Actually, even the resulting split
pages will themselves usually be split again, and maybe even once more
after that. We started out with leaf pages that stored just under 370
items on each leaf page (with fillfactor 90 + 8KiB BLCKSZ), and end up
with leaf pages that often have less than 50 items (sometimes as few
as 10). Even though the "logical contents" of the index are *totally*
unchanged. This could almost be considered pathological by users.

Of course, it's easy to imagine a case where it matters a lot more
than classic pgbench (pgbench_tellers_pkey and pgbench_branches_pkey
are always small, so it's easy to see the effect, which is why I went
with that example). For example, you could have a long running
transaction, which would probably have the effect of significantly
bloating even the large pgbench index (pgbench_accounts_pkey) --
typically you won't see that with classic pgbench until you do
something to frustrate VACUUM (and opportunistic cleanup). (I have
mostly been using non-HOT UPDATEs to test the patch, though.)

In theory we could go even further than this by having some kind of
version store for indexes, and using this to stash old versions rather
than performing a page split. Then you wouldn't have any page splits
in the pgbench indexes; VACUUM would eventually be able to return the
index to its "pristine" state. The trade-off with that design would be
that index scans would have to access two index pages for a while (a
leaf page, plus its subsidiary old version page). Maybe we can
actually go that far in the future -- there are various database
research papers that describe designs like this (the designs described
within these papers do things like determine whether a "version split"
or a "value split" should be performed).

What we have now is an incremental improvement, that doesn't have any
apparent downside with unique indexes -- the way that deduplication is
triggered for unique indexes is almost certain to be a win. When
deduplication isn't triggered, everything works in the same way as
before -- it's "zero overhead" for unique indexes that don't benefit.
The design augments existing garbage collection mechanisms,
particularly the way in which we set LP_DEAD bits within
_bt_check_unique().

Do we have any benchmarks where non-HOT updates cause slowdowns that we
can test on this?

AFAICT, any workload that has lots of non-HOT updates will benefit at
least a little bit -- indexes will finish up smaller, there will be
higher throughput, and there will be a reduction in latency for
queries.

With the right distribution of values, it's not that hard to mostly
control bloat in an index that doubles in size without the
optimization, which is much more significant. I have already reported
on this [1]/messages/by-id/CAH2-WzkXHhjhmUYfVvu6afbojU97MST8RUT1U=hLd2W-GC5FNA@mail.gmail.com -- Peter Geoghegan. I've also been able to observe increases of 15%-20% in
TPS with similar workloads (with commensurate reductions in query
latency) more recently. This was with a simple gaussian distribution
for pgbench_accounts.aid, and a non-unique index with deduplication
enabled on pgbench_accounts.abalance. (The patch helps control the
size of both indexes, especially the extra non-unique one.)

[1]: /messages/by-id/CAH2-WzkXHhjhmUYfVvu6afbojU97MST8RUT1U=hLd2W-GC5FNA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#121

Bruce Momjian

bruce@momjian.us

about 6 years ago

In reply to: Peter Geoghegan (#120)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote:

With many real world unique indexes, the true reason behind most or
all B-Tree page splits is "version churn". I view these page splits as
a permanent solution to a temporary problem -- we *permanently*
degrade the index structure in order to deal with a *temporary* burst
in versions that need to be stored. That's really bad.

Yes, I was thinking why do we need to optimize duplicates in a unique
index but then remembered is a version problem.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#122

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Bruce Momjian (#121)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Dec 17, 2019 at 5:18 PM Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote:

With many real world unique indexes, the true reason behind most or
all B-Tree page splits is "version churn". I view these page splits as
a permanent solution to a temporary problem -- we *permanently*
degrade the index structure in order to deal with a *temporary* burst
in versions that need to be stored. That's really bad.

Yes, I was thinking why do we need to optimize duplicates in a unique
index but then remembered is a version problem.

The whole idea of deduplication in unique indexes is hard to explain.
It just sounds odd. Also, it works using the same infrastructure as
regular deduplication, while having rather different goals.
Fortunately, it seems like we don't really have to tell users about it
in order for them to see a benefit -- there will be no choice for them
to make there (they just get it).

The regular deduplication stuff isn't confusing at all, though. It has
some noticeable though small downside, so it will be documented and
configurable. (I'm optimistic that it can be enabled by default,
because even with high cardinality non-unique indexes the downside is
rather small -- we waste some CPU cycles just before a page is split.)

--
Peter Geoghegan

#123

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#118)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Dec 12, 2019 at 6:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Still waiting for some review of the first patch, to get it out of the
way. Anastasia?

I plan to commit this first patch [1]/messages/by-id/CAH2-WzkWLRDzCaxsGvA_pZoaix_2AC9S6=-D6JMLkQYhqrJuEg@mail.gmail.com -- Peter Geoghegan in the next day or two, barring
any objections.

It's clear that the nbtree "pin scan" VACUUM code is totally
unnecessary -- it really should have been fully removed by commit
3e4b7d87 back in 2016.

[1]: /messages/by-id/CAH2-WzkWLRDzCaxsGvA_pZoaix_2AC9S6=-D6JMLkQYhqrJuEg@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#124

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#123)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Dec 17, 2019 at 7:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

I plan to commit this first patch [1] in the next day or two, barring
any objections.

I pushed this earlier today -- it became commit 9f83468b. Attached is
v27, which fixes the bitrot against the master branch.

Other changes:

* Updated _bt_form_posting() to consistently MAXALIGN(). No behavioral
changes here. The defensive SHORTALIGN()s we had in v26 should have
been defensive MAXALIGN()s -- this has been fixed. Also, we now also
explain our precise assumptions around alignment.

* Cleared up the situation around _bt_dedup_one_page()'s
responsibilities as far as LP_DEAD items go.

* Fixed bug in 32 KiB BLCKSZ builds. We now apply an additional
INDEX_SIZE_MASK cap on posting list tuple size.

--
Peter Geoghegan

Attachments:

v27-0002-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/octet-stream; name=v27-0002-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From ee67a907f9a609d3106f9de82b4d7d6f056e582d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v27 2/3] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 111 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   6 +
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  36 ++++++
 doc/src/sgml/pageinspect.sgml                 |  80 +++++++------
 4 files changed, 181 insertions(+), 52 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..17f7ad186e 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..1d45cd5c1e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,6 +41,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +57,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..70f1ab0467 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,39 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..1763e9c6f0 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -329,11 +329,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +356,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +414,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v27-0003-DEBUG-Show-index-values-in-pageinspect.patchapplication/octet-stream; name=v27-0003-DEBUG-Show-index-values-in-pageinspect.patchDownload

From 8bc913006d05c39d3d687e024123b568b042a0d7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v27 3/3] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 65 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 17f7ad186e..4eab8df098 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +263,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 1d45cd5c1e..3da5f37c3e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v27-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v27-0001-Add-deduplication-to-nbtree.patchDownload

From df53f49dde13732b44958c181a43ce6ce502caf3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v27 1/3] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split will be required if deduplication
can't free up enough space.  New "posting list tuples" are formed by
merging together existing duplicate tuples.  The physical representation
of the items on an nbtree leaf page is made more space efficient by
deduplication, but the logical contents of the page are not changed.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  Much larger
reductions in index size are possible in less common cases, where
individual index tuple keys happen to be large.  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.

The lazy approach taken by nbtree has significant advantages over a
GIN-style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The "key space" of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.
Furthermore, deduplication can be turned on or off as needed, or applied
selectively when required.  The split point choice logic doesn't need to
be changed, since posting list tuples are just tuples with payload, much
like tuples with non-key columns in INCLUDE indexes. (nbtsplitloc.c is
still optimized to make intelligent choices in the presence of posting
list tuples, though only because suffix truncation will routinely make
new high keys far far smaller than the non-pivot tuple they're derived
from).

Unique indexes can also make use of deduplication, though the strategy
used has significantly differences.  The high-level goal is to entirely
prevent "unnecessary" page splits -- splits caused only by a short term
burst of index tuple versions.  This is often a concern with frequently
updated tables where UPDATEs always modify at least one indexed column
(making it impossible for the table am to use an optimization like
heapam's heap-only tuples optimization).  Deduplication in unique
indexes effectively "buys time" for existing nbtree garbage collection
mechanisms to run and prevent these page splits (the LP_DEAD bit setting
performed during the uniqueness check is the most important mechanism
for controlling bloat with affected workloads).

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h                   | 404 ++++++++--
 src/include/access/nbtxlog.h                  |  75 +-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 130 +++-
 src/backend/access/nbtree/nbtdedup.c          | 735 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 360 ++++++++-
 src/backend/access/nbtree/nbtpage.c           | 219 +++++-
 src/backend/access/nbtree/nbtree.c            | 178 ++++-
 src/backend/access/nbtree/nbtsearch.c         | 270 ++++++-
 src/backend/access/nbtree/nbtsort.c           | 202 ++++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 217 +++++-
 src/backend/access/nbtree/nbtxlog.c           | 232 +++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  25 +-
 src/backend/utils/misc/guc.c                  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 217 +++++-
 doc/src/sgml/btree.sgml                       | 115 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  25 +
 doc/src/sgml/ref/create_index.sgml            |  37 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 27 files changed, 3270 insertions(+), 283 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ef1eba0602..af9d099f69 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,9 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* GUC parameter */
+extern bool btree_deduplication;
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,6 +111,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -115,7 +119,8 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *
  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
@@ -132,8 +137,8 @@ typedef struct BTMetaPageData
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number in metapage */
 #define BTREE_VERSION	4		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
-#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
+#define BTREE_MIN_VERSION	2	/* minimum supported version */
+#define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
@@ -156,6 +161,27 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +256,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -263,9 +288,9 @@ typedef struct BTMetaPageData
  * offset field only stores the number of columns/attributes when the
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
- * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
- * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
- * BTREE_VERSION 4.
+ * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
+ * t_info is always set on BTREE_VERSION 4.  BT_PIVOT_HEAP_TID_ATTR can
+ * only be set on BTREE_VERSION 4.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -283,40 +308,136 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication merges together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation.  A posting list is
+ * an array of ItemPointerData elements.  Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
-#define BT_HEAP_TID_ATTR			0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
+#define BT_PIVOT_HEAP_TID_ATTR		0x1000
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_N_POSTING_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple itup)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&itup->t_tid);
+	return (existing & BT_N_POSTING_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple itup)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	return ItemPointerGetBlockNumberNoCheck(&itup->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple itup)
+{
+	return (ItemPointer) ((char *) itup + BTreeTupleGetPostingOffset(itup));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple itup, int n)
+{
+	return BTreeTupleGetPosting(itup) + n;
+}
+
+/*
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that itup is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple itup)
+{
+	return ItemPointerGetBlockNumberNoCheck(&itup->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple itup, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&itup->t_tid, blkno);
+}
 
 /*
  * Get/set number of attributes within B-tree index tuple.
@@ -324,46 +445,108 @@ typedef struct BTMetaPageData
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined at a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get/set leaf page highkey's link. During the second phase of deletion, the
+ * target leaf page's high key may point to an ancestor page (at all other
+ * times, the leaf level high key's link is not used).  See the nbtree README
+ * for full details.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple itup)
+{
+	return ItemPointerGetBlockNumberNoCheck(&itup->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple itup, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&itup->t_tid, blkno);
+	BTreeTupleSetNAtts(itup, 0);
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
-	} while(0)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
+
+/*
+ * Set the heap TID attribute for a pivot tuple
+ */
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple itup)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(itup));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&itup->t_tid);
+	ItemPointerSetOffsetNumber(&itup->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -435,6 +618,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -470,6 +658,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -508,10 +697,60 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  This will be -1 in rare cases where
+	 * the overlapping posting list is LP_DEAD.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal physical
+ * items from the page, starting from page offset number 'baseoff'.  This is
+ * the offset number of the "base" tuple for new posting list.  'nitems' is
+ * the current total number of existing items from the page that will be
+ * merged to make a new posting list tuple, including the base tuple item.
+ * (Existing physical items may themselves be posting list tuples, or regular
+ * non-pivot tuples.)
+ *
+ * Note that when deduplication merges together existing physical tuples, the
+ * page is modified eagerly.  This makes tracking the details of more than a
+ * single pending posting list at a time unnecessary.  The total size of the
+ * existing tuples to be freed when pending posting list is processed gets
+ * tracked by 'phystupsize'.  This information allows deduplication to
+ * calculate the space saving for each new posting list tuple, and for the
+ * entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing physical tuples */
+	Size		phystupsize;	/* Includes line pointer overhead */
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -535,7 +774,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -579,7 +821,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -687,6 +929,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -695,8 +938,16 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplication : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique || btree_deduplication)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -743,6 +994,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buf, BTDedupState state,
+									 bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -761,14 +1028,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -779,7 +1048,9 @@ extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updatable, IndexTuple *updated,
+								int nupdatable);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
 /*
@@ -829,6 +1100,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 260d4af85c..2b721c6cf5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,33 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it is.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting().
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -91,9 +104,18 @@ typedef struct xl_btree_insert
  *
  * Backup Blk 0: original page / new left page
  *
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem).  An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set.  This corresponds to
+ * the xl_btree_insert INSERT_POST case.  Note that postingoff will be set to
+ * zero (no posting split) when a posting list split occurs where both
+ * original posting list and newitem go on the right page, since recovery
+ * doesn't need to consider the posting list split at all.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +133,32 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * "logical" tuples within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -152,19 +191,25 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are actually deletions of "logical" tuples contained
+ * in an existing posting list tuple that will still have some remaining
+ * logical tuples once VACUUM finishes.
  *
  * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * item to delete or update.
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES TO ADD BACK FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +290,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 48377ace24..2b37afd9e5 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 334ef76e89..8821b6ccba 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,131 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+while freeing just as much space.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits (the goals for
+deduplication in unique indexes are rather different; see "Deduplication in
+unique indexes" for details).  Deduplication alters the physical
+representation of tuples without changing the logical contents of the
+index, and without adding overhead to read queries.  Non-pivot tuples are
+merged together into a single physical tuple with a posting list (a simple
+array of heap TIDs with the standard item pointer format).  Deduplication
+is always applied lazily, at the point where it would otherwise be
+necessary to perform a page split.  It occurs only when LP_DEAD items have
+been removed, as our last line of defense against splitting a leaf page.
+We can set the LP_DEAD bit with posting list tuples, though only when all
+table tuples are known dead. (Bitmap scans cannot perform LP_DEAD bit
+setting, and are the common case with indexes that contain lots of
+duplicates, so this downside is considered acceptable.)
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple.  When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed.  Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting, though, so it's not clear how compression could be
+integrated with nbtree.  Besides, posting list compression does not offer a
+compelling trade-off for nbtree, since in general nbtree is optimized for
+consistent performance with many concurrent readers and writers.  A major
+goal of nbtree's lazy approach to deduplication is to limit the performance
+impact of deduplication with random updates.  Even concurrent append-only
+inserts of the same key value will tend to have inserts of individual index
+tuples in an order that doesn't quite match heap TID order.  In general,
+delaying deduplication avoids many unnecessary posting list splits, and
+minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Like all index access methods, nbtree does not have direct knowledge of
+versioning or of MVCC; it deals only with physical tuples.  However, unique
+indexes implicitly give nbtree basic information about tuple versioning,
+since by definition zero or one tuples of any given key value can be
+visible to any possible MVCC snapshot (excluding index entries with NULL
+values).  When optimizations such as heapam's Heap-only tuples (HOT) happen
+to be ineffective, nbtree's on-the-fly deletion of tuples in unique indexes
+can be very important with UPDATE-heavy workloads.  Unique checking's
+LP_DEAD bit setting reliably attempts to kill old, equal index tuple
+versions.  Importantly, this prevents (or at least delays) page splits that
+are necessary only because a leaf page must contain multiple physical
+tuples for the same logical row.
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column is very likely to only have page splits caused by the
+insertion of new logical rows when the rightmost leaf page is split.
+Splitting any other leaf page is unlikely to occur unless there is a
+short-term need to have multiple versions/index tuples for the same logical
+row version.  Splitting a leaf page purely to store multiple versions
+should be considered pathological, since it permanently degrades the index
+structure in order to absorb a temporary burst of duplicates.  (Page
+deletion can never reverse the page split, since it can only be applied to
+leaf pages that are completely empty.)
+
+Deduplication within unique indexes is useful purely as a final line of
+defense against version-driven page splits.  The strategy used is
+significantly different to the "standard" deduplication strategy.  Unique
+index leaf pages only get a deduplication pass when an insertion (that
+might have to split the page) observed an existing duplicate on the page in
+passing.  This is based on the assumption that deduplication will only work
+out when _all_ new insertions are duplicates from UPDATEs.  This may mean
+that we miss an opportunity to delay a page split, but that's okay because
+our ultimate goal is to delay leaf page splits _indefinitely_ (i.e. to
+prevent them altogether); there is little point in trying to delay a split
+that is probably inevitable anyway.  This allows us to avoid the overhead
+of attempting to deduplicate in the common case where a leaf page in a
+unique index cannot ever have any duplicates (e.g. with a unique index on
+an append-only table).
+
+Furthermore, it's particularly important that LP_DEAD bit setting not be
+restricted by deduplication in the case of unique indexes.  This is why
+unique index deduplication is applied in an incremental, cooperative
+fashion (it is "extra lazy").  Besides, there is no possible advantage to
+having unique index deduplication do more than the bare minimum to avoid an
+immediate page split -- that is its only goal.  Deleting items on the page
+is always preferable to deduplication.
+
+Note that the btree_deduplication GUC is not considered when deduplicating
+in a unique index, since unique index deduplication is treated as a
+separate, internal-only optimization that doesn't need to be configured by
+users. (The deduplication storage parameter is still respected, though only
+for debugging purposes.)
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..6311ef9311
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,735 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is completely different.
+ * Deduplication works in tandem with garbage collection, especially the
+ * LP_DEAD bit setting that takes place in _bt_check_unique().  We give up as
+ * soon as it becomes clear that enough space has been made available to
+ * insert newitem without needing to split the page.  Also, we merge together
+ * larger groups of duplicate tuples first (merging together two index tuples
+ * usually saves very little space), and avoid merging together existing
+ * posting list tuples.  The goal is to generate posting lists with TIDs that
+ * are "close together in time", in order to maximize the chances of an
+ * LP_DEAD bit being set opportunistically.  See nbtree/README for more
+ * information on deduplication within unique indexes.
+ *
+ * nbtinsert.c caller should call _bt_vacuum_one_page() before calling here.
+ * Note that this routine will delete all items on the page that have their
+ * LP_DEAD bit set, even when page's BTP_HAS_GARBAGE bit is not set (a rare
+ * edge case).  Caller can rely on that to avoid inserting a new tuple that
+ * happens to overlap with an existing posting list tuple with its LP_DEAD bit
+ * set. (Calling here with a newitemsz of 0 will reliably delete the existing
+ * item, making it possible to avoid unsetting the LP_DEAD bit just to insert
+ * the new item.  In general, posting list splits should never have to deal
+ * with a posting list tuple with its LP_DEAD bit set.)
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	BTDedupState state = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* init deduplication state needed to build posting tuples */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Caller should call _bt_vacuum_one_page() before calling here when it
+	 * looked like there were LP_DEAD items on the page.  However, we can't
+	 * assume that there are no LP_DEAD items (for one thing, VACUUM will
+	 * clear the BTP_HAS_GARBAGE hint without reliably removing items that are
+	 * marked LP_DEAD).  We must be careful to clear all LP_DEAD items because
+	 * posting list splits cannot go ahead if an existing posting list item
+	 * has its LP_DEAD bit set. (Also, we don't want to unnecessarily unset
+	 * LP_DEAD bits when deduplicating items on the page below, though that
+	 * should be harmless.)
+	 *
+	 * The opposite problem is also possible: _bt_vacuum_one_page() won't
+	 * clear the BTP_HAS_GARBAGE bit when it is falsely set (i.e. when there
+	 * are no LP_DEAD bits).  This probably doesn't matter in practice, since
+	 * it's only a hint, and VACUUM will clear it at some point anyway.  Even
+	 * still, we clear the BTP_HAS_GARBAGE hint reliably here. (Seems like a
+	 * good idea for deduplication to only begin when we unambiguously have no
+	 * LP_DEAD items.)
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split by following the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+		{
+			pfree(state);
+			return;
+		}
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+	else if (P_HAS_GARBAGE(opaque))
+	{
+		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(buf, true);
+	}
+
+	/*
+	 * Return early in case where caller just wants us to kill an existing
+	 * LP_DEAD posting list tuple
+	 */
+	Assert(!P_HAS_GARBAGE(opaque));
+	if (newitemsz == 0)
+	{
+		pfree(state);
+		return;
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: It's essential to reassess the
+	 * max offset on each iteration, since it will change as items are
+	 * deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buf, state,
+												   RelationNeedsWAL(rel));
+
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		leftfree;
+
+					/* This calculation needs to match nbtsplitloc.c */
+					leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					/* Subtract predicted size of new high key */
+					leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+					/*
+					 * Reduce maxitemsize by an amount equal to target free
+					 * space on left half of page
+					 */
+					state->maxitemsize -= leftfree *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buf, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->phystupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit.
+	 *
+	 * This calculation needs to match the accounting code used within
+	 * _bt_form_posting() for new posting lists.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists in first checkingunique pass */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buf, BTDedupState state, bool need_wal)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buf);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->phystupsize - (finalsz + sizeof(ItemIdData));
+		/* Must save some space, and must not exceed tuple limits */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete original items */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple, replacing original items */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buf);
+
+		/* Log deduplicated items */
+		if (need_wal)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->baseoff;
+			xlrec_dedup.nitems = state->nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/*
+	 * Add extra space needed for posting list.
+	 *
+	 * This calculation needs to match the accounting code used within
+	 * _bt_dedup_save_htid().
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* resize */
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(BTreeTupleIsPosting(itup));
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Assert that htid array is sorted and has unique TIDs */
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				current = BTreeTupleGetPostingN(itup, i);
+				Assert(ItemPointerCompare(current, &last) > 0);
+				ItemPointerCopy(current, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/*
+		 * Copy only TID in htids array to header field (i.e. create standard
+		 * non-pivot representation)
+		 */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps.  This approach avoids any explicit WAL-logging of a posting
+ * list tuple.  This is important because posting lists are often much larger
+ * than plain tuples.
+ *
+ * Caller should avoid assuming that the IndexTuple-wise key representation in
+ * newitem is bitwise equal to the representation used within oposting.  Note,
+ * in particular, that one may even be larger than the other.  This could
+ * occur due to differences in TOAST input state, for example.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2e8e60cd0c..9885041521 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+bool		btree_deduplication = true;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +65,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +130,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +359,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prev_all_dead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +383,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +449,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prev_all_dead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +545,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +603,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prev_all_dead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +624,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prev_all_dead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +671,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +741,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -704,6 +757,8 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		bool		dedupunique = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -717,9 +772,25 @@ _bt_findinsertloc(Relation rel,
 		 * tuple belongs on.  The heap TID attribute for new tuple (scantid)
 		 * could force us to insert on a sibling page, though that should be
 		 * very rare in practice.
+		 *
+		 * checkingunique inserters that encounter a duplicate will apply
+		 * deduplication when it looks like there will be a page split, but
+		 * there is no LP_DEAD garbage on the leaf page to vacuum away (or
+		 * there wasn't enough space freed by LP_DEAD cleanup).  This
+		 * complements the opportunistic LP_DEAD vacuuming mechanism.  The
+		 * high level goal is to avoid page splits caused by new, unchanged
+		 * versions of existing logical rows altogether.  See nbtree/README
+		 * for full details.
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				dedupunique = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -746,18 +817,37 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (helpful when initial page is empty) */
+				dedupunique = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, try to obtain
+		 * enough free space to avoid a page split by deduplicating existing
+		 * items (if deduplication is safe).
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates if checkingunique */
+				dedupunique = true;
+			}
+
+			if (itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz &&
+				(!checkingunique || dedupunique))
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +929,29 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the event that this happens, kill all LP_DEAD
+	 * items by calling _bt_dedup_one_page() (this won't actually dedup).
+	 */
+	if (insertstate->postingoff == -1)
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+						   insertstate->itup, 0, true);
+
+		/*
+		 * Do new binary search, having killed LP_DEAD items.  New insert
+		 * location cannot overlap with any posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return location;
 }
 
 /*
@@ -905,10 +1017,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'postingoff' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1032,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1051,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1073,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1085,39 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list by swapping new item's heap TID with
+		 * the rightmost heap TID from original posting list, and generating a
+		 * new version of the posting list that has new item's heap TID.
+		 *
+		 * Posting list splits work by modifying the overlapping posting list
+		 * as part of the same atomic operation that inserts the "new item".
+		 * The space accounting is kept simple, since it does not need to
+		 * consider posting list splits at all (this is particularly important
+		 * for the case where we also have to split the page).  Overwriting
+		 * the posting list with its post-split version is treated as an extra
+		 * step in either the insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID for xlog record */
+		origitup = CopyIndexTuple(itup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+		/* Alter offset so that it goes after existing posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1150,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1230,13 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		/*
+		 * Posting list split requires an in-place update of the existing
+		 * posting list
+		 */
+		if (nposting)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1120,8 +1282,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1144,6 +1317,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1326,28 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+			/*
+			 * We always write newitem to the page, but when there is an
+			 * original newitem due to a posting list split then we log the
+			 * original item instead.  REDO routine must reconstruct the final
+			 * newitem at the same time it reconstructs nposting.
+			 */
+			if (postingoff == 0)
+				XLogRegisterBufData(0, (char *) itup,
+									IndexTupleSize(itup));
+			else
+			{
+				/*
+				 * Must explicitly log posting off before newitem in case of
+				 * posting list split.
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1389,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		pfree(nposting);
+		pfree(origitup);
+	}
 }
 
 /*
@@ -1209,12 +1411,25 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		newitem and nposting are replacements for orignewitem and the
+ *		existing posting list on the page respectively.  These extra
+ *		posting list split details are used here in the same way as they
+ *		are used in the more common case where a posting list split does
+ *		not coincide with a page split.  We need to deal with posting list
+ *		splits directly in order to ensure that everything that follows
+ *		from the insert of orignewitem is handled as a single atomic
+ *		operation (though caller's insert of a new pivot/downlink into
+ *		parent page will still be a separate operation).
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,12 +1451,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
 	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
+	/*
+	 * Determine offset number of existing posting list on page when a split
+	 * of a posting list needs to take place as the page is split
+	 */
+	if (nposting != NULL)
+	{
+		Assert(itup_key->heapkeyspace);
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1499,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size,
+	 * and both will have the same base tuple key values, so split point
+	 * choice is never affected.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1573,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1609,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1388,6 +1627,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1480,8 +1720,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1905,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (replacepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1929,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  Recovery must
+		 * reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem despite newitem going on the
+		 * right page.  If XLogInsert decides that it can omit orignewitem due
+		 * to logging a full-page image of the left page, everything still
+		 * works out, since recovery only needs to log orignewitem for items
+		 * on the left page (just like the regular newitem-logged case).
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.postingoff != 0)
+		{
+			if (xlrec.postingoff == 0)
+			{
+				/* Must WAL-log newitem, since it's on left page */
+				Assert(newitemonleft);
+				Assert(orignewitem == NULL && nposting == NULL);
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+			}
+			else
+			{
+				/* Must WAL-log orignewitem following posting list split */
+				Assert(newitemonleft || firstright == newitemoff);
+				Assert(ItemPointerCompare(&orignewitem->t_tid,
+										  &newitem->t_tid) < 0);
+				XLogRegisterBufData(0, (char *) orignewitem,
+									MAXALIGN(IndexTupleSize(orignewitem)));
+			}
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2127,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2483,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2303,6 +2597,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 404bad7da2..f1d0999840 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +633,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +677,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +705,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -968,28 +997,78 @@ _bt_page_recyclable(Page page)
  * deleting the page it points to.
  *
  * This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given deletable array *must* be sorted in ascending order.
+ * Also, the given deletable and updatable arrays *must* be sorted in
+ * ascending order.
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate recovery conflicts by accessing the heap inline, whereas VACUUMs
  * can rely on the initial heap scan taking care of the problem (pruning would
- * have generated the conflicts needed for hot standby already).
+ * have generated the conflicts needed for hot standby already).  Also,
+ * VACUUMs must deal with the case where posting list tuples have some dead
+ * TIDs, and some remaining TIDs that must not be killed.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updatable, IndexTuple *updated,
+					int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		updated_sz = 0;
+	char	   *updated_buf = NULL;
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff, buffer for updated */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+			updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+		updated_buf = palloc(updated_sz);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itemsz = IndexTupleSize(updated[i]);
+			memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == updated_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuple updates */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/*
+		 * Delete the old posting tuple first.
+		 *
+		 * This will also clear the LP_DEAD bit, which can legitimately be set
+		 * by a backend even when VACUUM doesn't consider all the logical
+		 * tuples dead.  (We do this to be consistent.  It would be correct to
+		 * leave it set, but we're going to unset the BTP_HAS_GARBAGE bit
+		 * anyway.)
+		 */
+		PageIndexTupleDelete(page, updatable[i]);
+
+		itemsz = IndexTupleSize(updated[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with updated ItemPointers to the page */
+		if (PageAddItem(page, (Item) updated[i], itemsz, updatable[i], false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1016,6 +1095,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1026,8 +1106,21 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		 * is.  When XLogInsert stores the whole buffer, the offsets array
 		 * need not be stored too.
 		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		/*
+		 * Here we should save offnums and updated tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle updated tuples and only after that other deleted items.
+		 */
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatable,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updated_buf, updated_sz);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1035,6 +1128,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updated_buf != NULL)
+		pfree(updated_buf);
 }
 
 /*
@@ -1047,7 +1144,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  *
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own recovery conflicts by accessing
- * the heap.  See comments for _bt_delitems_vacuum.
+ * the heap, and doesn't handle updating posting list tuples.  See comments
+ * for _bt_delitems_vacuum.
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1063,8 +1161,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1117,6 +1215,92 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointer htids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+	/* be tidy */
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2062,6 +2246,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 065b5290b0..6ef992ac02 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1151,8 +1153,17 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
+		/* Deletable item state */
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable;
+		int			nhtidsdead;
+		int			nhtidslive;
+
+		/* Updatable item state (for posting lists) */
+		IndexTuple	updated[MaxOffsetNumber];
+		OffsetNumber updatable[MaxOffsetNumber];
+		int			nupdatable;
+
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
@@ -1187,6 +1198,10 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
+		/* Maintain whole-page stat counters for live/dead heap TIDs */
+		nhtidslive = 0;
+		nhtidsdead = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1196,11 +1211,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1222,8 +1235,71 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard heap TID representation */
+					ItemPointer htid = &(itup->t_tid);
+
+					if (callback(htid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All TIDs/logical tuples from the posting tuple
+						 * remain, so no update or delete required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * for when we update it in place
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All TIDs/logical tuples from the posting list must
+						 * be deleted.  We'll delete the physical tuple
+						 * completely.
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
@@ -1231,13 +1307,18 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updated[i]);
 		}
 		else
 		{
@@ -1250,6 +1331,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1259,15 +1341,16 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * heap TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
 	}
 
 	if (delete_now)
@@ -1299,9 +1382,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1310,6 +1394,68 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple.  The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining.  This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	/*
+	 * Check each tuple in the posting list.  Save live tuples into tmpitems,
+	 * though try to avoid memory allocation as an optimization.
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live heap TID.
+			 *
+			 * Only save live TID when we know that we're going to have to
+			 * kill at least one TID, and have already allocated memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+
+		/* Dead heap TID */
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * Turns out we need to delete one or more dead heap TIDs, so
+			 * start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live heap TIDs from previous loop iterations */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index b62648d935..def890e4d7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +550,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->safededup);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +626,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1340,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1595,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1656,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1664,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1705,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1753,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1767,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1781,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c8110a130a..c56e2f8f03 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1195,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1294,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1309,111 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(Min(BTMaxItemSize(state->btps_page), INDEX_SIZE_MASK),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1421,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 29167f1ef5..950c6d7673 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ee972a1465..55dbdac1f0 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplication)}
 
 	};
 
@@ -2138,6 +2204,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2233,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2252,18 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include a lastleft heap
+			 * TID in the final pivot, but we can remove the posting list now.
+			 * (Pivot tuples should never contain a posting list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2291,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2302,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2320,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2329,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2411,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2398,22 +2498,36 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2571,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2602,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2676,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the "deduplicate" storage
+	 * parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 234b0e0596..56b5f91027 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* newitem must be mutable copy for _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* newitem must be mutable copy for _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +370,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -379,6 +457,80 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items to that in
+		 * item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState state;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+
+		state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+
+		/* Conservatively size array */
+		state->htids = palloc(state->maxitemsize);
+
+		/*
+		 * Merge existing physical tuples, starting with the base physical
+		 * tuple
+		 */
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			Assert(!ItemIdIsDead(itemid));
+
+			if (offnum == xlrec->baseoff)
+			{
+				/*
+				 * No previous/base tuple for first data item -- use first
+				 * data item as base tuple of first pending posting list
+				 */
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+			else
+			{
+				/* Heap TID(s) for itup will be saved in state */
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		/* Handle the last item */
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -401,7 +553,38 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		/*
+		 * Must update posting list tuples before deleting whole items, since
+		 * offset numbers are based on original page contents
+		 */
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			/* Handle posting tuple updates */
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				PageIndexTupleDelete(page, updatedoffsets[i]);
+
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+								false, false) == InvalidOffsetNumber)
+					elog(PANIC, "failed to add updated posting list item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -735,17 +918,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -753,6 +941,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -778,6 +969,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 497f8dc77e..23e951aa9e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level,
+								 xlrec->firstright,
+								 xlrec->newitemoff,
+								 xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff, xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +143,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8d951ce404..9560df7a7c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -1091,6 +1092,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		check_bonjour, NULL, NULL
 	},
+	{
+		{"btree_deduplication", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&btree_deduplication,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
 			gettext_noop("Collects transaction commit time."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 087190ce63..739676b9d0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -651,6 +651,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#btree_deduplication = on
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5e0db3515d..3ded74dc1c 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1679,14 +1679,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 194c93fd3a..82a23d86c6 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +928,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +959,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1001,21 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1027,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1093,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1119,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1152,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1193,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1232,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2044,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2109,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are merged together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2189,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2197,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2653,52 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	Assert(state->heapkeyspace);
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Make sure that tuple type (pivot vs non-pivot) matches caller's
+	 * expectation
+	 */
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return BTreeTupleGetHeapTID(itup);
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..059477be1e 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,122 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values) a
+   storage size of about <emphasis>one third</emphasis> of the
+   unoptimized case is expected.  There is often a direct benefit for
+   queries, as well as an indirect benefit due to reduced I/O during
+   routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   enabled.  The <literal>deduplication</literal> storage parameter
+   can be used to override the configuration paramater for individual
+   indexes.  See <xref linkend="sql-createindex-storage-parameters"/>
+   from the <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.  In
+   practice almost all datatypes support deduplication, though
+   <type>numeric</type> is a notable exception (the <quote>display
+   scale</quote> feature makes it impossible to enable deduplication
+   without losing useful information about equal <type>numeric</type>
+   datums).  Deduplication is not supported with nondeterministic
+   collations, nor is it supported with <literal>INCLUDE</literal>
+   indexes.
+  </para>
+  <para>
+   Note that a multicolumn index is only considered to have duplicates
+   when there are index entries that repeat entire
+   <emphasis>combinations</emphasis> of values (the values stored in
+   each and every column must be equal).
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Internal use of Deduplication in unique indexes</title>
+
+  <para>
+   Page splits that occur due to inserting multiple physical versions
+   (rather than inserting new logical rows) tend to degrade the
+   structure of indexes, especially in the case of unique indexes.
+   Unique indexes use deduplication <emphasis>internally</emphasis>
+   and <emphasis>selectively</emphasis> to delay (and ideally to
+   prevent) these <quote>unnecessary</quote> page splits.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but usually cannot
+   reverse page splits (in general, the page must be completely empty
+   before <command>VACUUM</command> can <quote>delete</quote> it).
+  </para>
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter does not affect whether or not deduplication is used
+   within unique indexes.  The internal use of deduplication for
+   unique indexes is subject to all of the same restrictions as
+   deduplication in general.  The <literal>deduplication</literal>
+   storage parameter can be set to <literal>OFF</literal> to disable
+   deduplication in unique indexes, but this is intended only as a
+   debugging option for developers.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..05f442d57a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8021,6 +8021,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplication" xreflabel="btree_deduplication">
+      <term><varname>btree_deduplication</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>btree_deduplication</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls whether deduplication should be used within B-Tree
+        indexes.  Deduplication is an optimization that reduces the
+        storage size of indexes by storing equal index keys only once.
+        See <xref linkend="btree-deduplication"/> for more
+        information.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..e6cdba4c29 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+    <term><literal>deduplication</literal>
+     <indexterm>
+      <primary><varname>deduplication</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref linkend="guc-btree-deduplication"/>.
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to override GUC.
+      (Alternative spellings of <literal>ON</literal> and
+      <literal>OFF</literal> are allowed as described in <xref
+      linkend="config-setting"/>.)
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplication</literal> off via <command>ALTER
+      INDEX</command> prevents future insertions from triggering
+      deduplication, but does not in itself make existing posting list
+      tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +477,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..e32c8fa826 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..627ba80bc1 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

#125

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#124)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Dec 19, 2019 at 6:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

I pushed this earlier today -- it became commit 9f83468b. Attached is
v27, which fixes the bitrot against the master branch.

Attached is v28, which fixes bitrot from my recent commits to refactor
VACUUM-related code in nbtpage.c.

Other changes:

* A big overhaul of the nbtree README changes -- "posting list splits"
now becomes its own section.

I tried to get the general idea across about posting lists in this new
section without repeating myself too much. Posting list splits are
probably the most subtle part of the overall design of the patch.
Posting lists piggy-back on a standard atomic action (insertion into a
leaf page, or leaf page split) on the one hand. On the other hand,
they're a separate and independent step at the conceptual level.

Hopefully the general idea comes across as clearly as possible. Some
feedback on that would be good.

* PageIndexTupleOverwrite() is now used for VACUUM's "updates", and
has been taught to not unset an LP_DEAD bit that happens to already be
set.

As the comments added by my recent commit 4b25f5d0 now mention, it's
important that VACUUM not unset LP_DEAD bits accidentally. VACUUM will
falsely unset the BTP_HAS_GARBAGE page flag at times, which isn't
ideal. Even still, unsetting LP_DEAD bits themselves is much worse
(even though BTP_HAS_GARBAGE exists purely to hint that one or more
LP_DEAD bits are set on the page).

Maybe we should go further here, and reconsider whether or not VACUUM
should *ever* unset BTP_HAS_GARBAGE. AFAICT, the only advantage of
nbtree VACUUM clearing it is that doing so might save a backend a
useless scan of the line pointer array to check for the LP_DEAD bits
directly. But the backend will have to split the page when that
happens anyway, which is a far greater cost. It's probably not even
noticeable, since we're already doing lots of stuff with the page when
it happens.

The BTP_HAS_GARBAGE hint probably mattered back when the "getting
tired" mechanism was used (i.e. prior to commit dd299df8). VACUUM
sometimes had a choice to make about which page to use, so quickly
getting an idea about LP_DEAD bits made a certain amount of
sense...but that's not how it works anymore. (Granted, we still do it
that way with pg_upgrade'd indexes from before Postgres 12, but I
don't think that that needs to be given any weight now.)

Thoughts on this?
--
Peter Geoghegan

Attachments:

v28-0003-DEBUG-Show-index-values-in-pageinspect.patchapplication/x-patch; name=v28-0003-DEBUG-Show-index-values-in-pageinspect.patchDownload

From 200899a4dfcdcdacc9a30cf87a1b311c97bd2000 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v28 3/3] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 65 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 337047ff9d..c51e2b3665 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +263,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 92d5c59654..fc6794ef65 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,7 +41,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v28-0001-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v28-0001-Add-deduplication-to-nbtree.patchDownload

From 013cda0a1edb18f4315b6ce16f1cd372188ba399 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v28 1/3] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split will be required if deduplication
can't free up enough space.  New "posting list tuples" are formed by
merging together existing duplicate tuples.  The physical representation
of the items on an nbtree leaf page is made more space efficient by
deduplication, but the logical contents of the page are not changed.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  Much larger
reductions in index size are possible in less common cases, where
individual index tuple keys happen to be large.  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.

The lazy approach taken by nbtree has significant advantages over a
GIN-style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The "key space" of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.
Furthermore, deduplication can be turned on or off as needed, or applied
selectively when required.  The split point choice logic doesn't need to
be changed, since posting list tuples are just tuples with payload, much
like tuples with non-key columns in INCLUDE indexes. (nbtsplitloc.c is
still optimized to make intelligent choices in the presence of posting
list tuples, though only because suffix truncation will routinely make
new high keys far far smaller than the non-pivot tuple they're derived
from).

Unique indexes can also make use of deduplication, though the strategy
used has significantly differences.  The high-level goal is to entirely
prevent "unnecessary" page splits -- splits caused only by a short term
burst of index tuple versions.  This is often a concern with frequently
updated tables where UPDATEs always modify at least one indexed column
(making it impossible for the table am to use an optimization like
heapam's heap-only tuples optimization).  Deduplication in unique
indexes effectively "buys time" for existing nbtree garbage collection
mechanisms to run and prevent these page splits (the LP_DEAD bit setting
performed during the uniqueness check is the most important mechanism
for controlling bloat with affected workloads).

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/itup.h                     |   5 +
 src/include/access/nbtree.h                   | 413 ++++++++--
 src/include/access/nbtxlog.h                  |  96 ++-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 151 +++-
 src/backend/access/nbtree/nbtdedup.c          | 738 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 342 +++++++-
 src/backend/access/nbtree/nbtpage.c           | 227 +++++-
 src/backend/access/nbtree/nbtree.c            | 180 ++++-
 src/backend/access/nbtree/nbtsearch.c         | 271 ++++++-
 src/backend/access/nbtree/nbtsort.c           | 202 ++++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 224 +++++-
 src/backend/access/nbtree/nbtxlog.c           | 201 ++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  23 +-
 src/backend/storage/page/bufpage.c            |   9 +-
 src/backend/utils/misc/guc.c                  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 234 +++++-
 doc/src/sgml/btree.sgml                       | 115 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  25 +
 doc/src/sgml/ref/create_index.sgml            |  37 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 29 files changed, 3306 insertions(+), 299 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index b9c41d3455..223f2e7eff 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: MaxIndexTuplesPerPage is a limit on the number of tuples on a page
+ * that consume a line pointer -- "physical" tuples.  Some index AMs can store
+ * a greater number of "logical" tuples, though (e.g., btree leaf pages with
+ * posting list tuples).
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f90ee3a0e0..6fae1ec079 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,9 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* GUC parameter */
+extern bool btree_deduplication;
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,6 +111,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -115,7 +119,8 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *
  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
@@ -132,8 +137,8 @@ typedef struct BTMetaPageData
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number in metapage */
 #define BTREE_VERSION	4		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
-#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
+#define BTREE_MIN_VERSION	2	/* minimum supported version */
+#define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
@@ -156,6 +161,27 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +256,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -263,9 +288,9 @@ typedef struct BTMetaPageData
  * offset field only stores the number of columns/attributes when the
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
- * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
- * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
- * BTREE_VERSION 4.
+ * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
+ * t_info is always set on BTREE_VERSION 4.  BT_PIVOT_HEAP_TID_ATTR can
+ * only be set on BTREE_VERSION 4.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -283,87 +308,252 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication merges together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation.  A posting list is
+ * an array of ItemPointerData elements.  Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
-#define BT_HEAP_TID_ATTR			0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
+#define BT_PIVOT_HEAP_TID_ATTR		0x1000
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_N_POSTING_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_N_POSTING_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined at a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -435,6 +625,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -470,6 +665,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -508,10 +704,60 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal physical
+ * items from the page, starting from page offset number 'baseoff'.  This is
+ * the offset number of the "base" tuple for new posting list.  'nitems' is
+ * the current total number of existing items from the page that will be
+ * merged to make a new posting list tuple, including the base tuple item.
+ * (Existing physical items may themselves be posting list tuples, or regular
+ * non-pivot tuples.)
+ *
+ * Note that when deduplication merges together existing physical tuples, the
+ * page is modified eagerly.  This makes tracking the details of more than a
+ * single pending posting list at a time unnecessary.  The total size of the
+ * existing tuples to be freed when pending posting list is processed gets
+ * tracked by 'phystupsize'.  This information allows deduplication to
+ * calculate the space saving for each new posting list tuple, and for the
+ * entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing physical tuples */
+	Size		phystupsize;	/* Includes line pointer overhead */
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -535,7 +781,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -579,7 +828,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -687,6 +936,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -695,8 +945,16 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplication : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique || btree_deduplication)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -743,6 +1001,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buf, BTDedupState state,
+									 bool logged);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -761,14 +1035,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -777,7 +1053,9 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updatable, IndexTuple *updated,
+								int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
@@ -830,6 +1108,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..de855efbba 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +154,32 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * "logical" tuples within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -152,19 +212,23 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
- *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are how deletes of "logical" tuples contained in an
+ * existing posting list tuple are implemented. (Updates are only used when
+ * there will be some remaining logical tuples once VACUUM finishes; otherwise
+ * the physical posting list tuple can just be deleted).
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES FOR OVERWRITES FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +309,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..18b1bf5e20 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..e235750597 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,152 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+while freeing just as much space.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.
+
+Like all index access methods, nbtree does not have direct knowledge of
+versioning or of MVCC; it deals only with physical tuples.  However, unique
+indexes implicitly give nbtree basic information about tuple versioning,
+since by definition zero or one tuples of any given key value can be
+visible to any possible MVCC snapshot (excluding index entries with NULL
+values).  When optimizations such as heapam's Heap-only tuples (HOT) happen
+to be ineffective, nbtree's on-the-fly deletion of tuples in unique indexes
+can be very important with UPDATE-heavy workloads.  Unique checking's
+LP_DEAD bit setting reliably attempts to kill old, equal index tuple
+versions.  This prevents (or at least delays) page splits that are
+necessary only because a leaf page must contain multiple physical tuples
+for the same logical row.  Deduplication in unique indexes must cooperate
+with this mechanism.  Deleting items on the page is always preferable to
+deduplication.
+
+The strategy used during a deduplication pass has significant differences
+to the strategy used for indexes that can have multiple logical rows with
+the same key value.  We're not really trying to store duplicates in a
+space efficient manner, since in the long run there won't be any
+duplicates anyway.  Rather, we're buying time for garbage collection
+mechanisms to run before a page split is needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves the situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..d9f4e9db38
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,738 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is completely different.
+ * Deduplication works in tandem with garbage collection, especially the
+ * LP_DEAD bit setting that takes place in _bt_check_unique().  We give up as
+ * soon as it becomes clear that enough space has been made available to
+ * insert newitem without needing to split the page.  Also, we merge together
+ * larger groups of duplicate tuples first (merging together two index tuples
+ * usually saves very little space), and avoid merging together existing
+ * posting list tuples.  The goal is to generate posting lists with TIDs that
+ * are "close together in time", in order to maximize the chances of an
+ * LP_DEAD bit being set opportunistically.  See nbtree/README for more
+ * information on deduplication within unique indexes.
+ *
+ * nbtinsert.c caller should call _bt_vacuum_one_page() before calling here.
+ * Note that this routine will delete all items on the page that have their
+ * LP_DEAD bit set, even when page's BTP_HAS_GARBAGE bit is not set (a rare
+ * edge case).  Caller can rely on that to avoid inserting a new tuple that
+ * happens to overlap with an existing posting list tuple with its LP_DEAD bit
+ * set. (Calling here with a newitemsz of 0 will reliably delete the existing
+ * item, making it possible to avoid unsetting the LP_DEAD bit just to insert
+ * the new item.  In general, posting list splits should never have to deal
+ * with a posting list tuple with its LP_DEAD bit set.)
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	/*
+	 * Caller should call _bt_vacuum_one_page() before calling here when it
+	 * looked like there were LP_DEAD items on the page.  However, we can't
+	 * assume that there are no LP_DEAD items (for one thing, VACUUM will
+	 * clear the BTP_HAS_GARBAGE hint without reliably removing items that are
+	 * marked LP_DEAD).  We must be careful to clear all LP_DEAD items because
+	 * posting list splits cannot go ahead if an existing posting list item
+	 * has its LP_DEAD bit set. (Also, we don't want to unnecessarily unset
+	 * LP_DEAD bits when deduplicating items on the page below, though that
+	 * should be harmless.)
+	 *
+	 * The opposite problem is also possible: _bt_vacuum_one_page() won't
+	 * clear the BTP_HAS_GARBAGE bit when it is falsely set (i.e. when there
+	 * are no LP_DEAD bits).  This probably doesn't matter in practice, since
+	 * it's only a hint, and VACUUM will clear it at some point anyway.  Even
+	 * still, we clear the BTP_HAS_GARBAGE hint reliably here. (Seems like a
+	 * good idea for deduplication to only begin when we unambiguously have no
+	 * LP_DEAD items.)
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split by following the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+	else if (P_HAS_GARBAGE(opaque))
+	{
+		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(buf, true);
+	}
+
+	/*
+	 * Return early in case where caller just wants us to kill an existing
+	 * LP_DEAD posting list tuple
+	 */
+	Assert(!P_HAS_GARBAGE(opaque));
+	if (newitemsz == 0)
+		return;
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: We must reassess the max offset
+	 * on each iteration, since the number of items on the page goes down as
+	 * existing items are deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buf, state,
+												   RelationNeedsWAL(rel));
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		leftfree;
+
+					/* This calculation needs to match nbtsplitloc.c */
+					leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					/* Subtract predicted size of new high key */
+					leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+					/*
+					 * Reduce maxitemsize by an amount equal to target free
+					 * space on left half of page
+					 */
+					state->maxitemsize -= leftfree *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buf, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->phystupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists in first checkingunique pass */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buf, BTDedupState state, bool logged)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buf);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller.)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->phystupsize - (finalsz + sizeof(ItemIdData));
+		/* Must save some space, and must not exceed tuple limits */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete original items */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple, replacing original items */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buf);
+
+		/* Log deduplicated items */
+		if (logged)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->baseoff;
+			xlrec_dedup.nitems = state->nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in item pointer array as
+ * required.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	/* We only use the key from the base tuple */
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * The calculation used when new tuple has a posting list needs to match
+	 * the code used within _bt_dedup_save_htid().
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() to match index_form_tuple() */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(BTreeTupleIsPosting(itup));
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Verify posting list invariants with assertions */
+			ItemPointerData last;
+			ItemPointer htid;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				htid = BTreeTupleGetPostingN(itup, i);
+
+				Assert(ItemPointerIsValid(htid));
+				Assert(ItemPointerCompare(htid, &last) > 0);
+				ItemPointerCopy(htid, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/*
+		 * Copy only TID in htids array to header field (i.e. create standard
+		 * non-pivot representation)
+		 */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		Assert(ItemPointerIsValid(&itup->t_tid));
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should probably pass their own
+ * private copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This generally happens inside the same
+ * critical section that performs an in-place update of old posting list using
+ * new posting list returned here).
+ *
+ * Caller should avoid assuming that the IndexTuple-wise key representation in
+ * newitem is bitwise equal to the representation used within oposting.  Note,
+ * in particular, that one may even be larger than the other.  This could
+ * occur due to differences in TOAST input state, for example.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(!BTreeTupleIsPivot(newitem));
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 144d339e8d..51468e0455 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+bool		btree_deduplication = true;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +129,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +305,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +358,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +382,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +448,28 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				Assert(!BTreeTupleIsPivot(curitup));
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +545,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +603,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +624,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +671,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +741,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -704,6 +757,8 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		bool		dedupunique = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -717,9 +772,25 @@ _bt_findinsertloc(Relation rel,
 		 * tuple belongs on.  The heap TID attribute for new tuple (scantid)
 		 * could force us to insert on a sibling page, though that should be
 		 * very rare in practice.
+		 *
+		 * checkingunique inserters that encounter a duplicate will apply
+		 * deduplication when it looks like there will be a page split, but
+		 * there is no LP_DEAD garbage on the leaf page to vacuum away (or
+		 * there wasn't enough space freed by LP_DEAD cleanup).  This
+		 * complements the opportunistic LP_DEAD vacuuming mechanism.  The
+		 * high level goal is to avoid page splits caused by new, unchanged
+		 * versions of existing logical rows altogether.  See nbtree/README
+		 * for full details.
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				dedupunique = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -746,18 +817,37 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (helpful when initial page is empty) */
+				dedupunique = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, try to obtain
+		 * enough free space to avoid a page split by deduplicating existing
+		 * items (if deduplication is safe).
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates if checkingunique */
+				dedupunique = true;
+			}
+
+			if (itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz &&
+				(!checkingunique || dedupunique))
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +929,36 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  _bt_insertonpg() cannot handle this, so delete all LP_DEAD
+		 * items early.  This is the only case where LP_DEAD deletes happen
+		 * even though a page split wouldn't take place if we went straight to
+		 * the _bt_insertonpg() call.
+		 *
+		 * Call _bt_dedup_one_page() instead of _bt_vacuum_one_page() to force
+		 * deletes (this avoids relying on the BTP_HAS_GARBAGE hint flag,
+		 * which might be falsely unset).  Call can't actually dedup items,
+		 * since we pass a newitemsz of 0.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+						   insertstate->itup, 0, true);
+
+		/*
+		 * Do new binary search, having killed LP_DEAD items.  New insert
+		 * location cannot overlap with any posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -905,10 +1024,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -936,11 +1057,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1079,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1090,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->safededup);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1150,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1071,6 +1226,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1120,8 +1278,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1144,6 +1313,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1322,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1384,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1209,12 +1407,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1234,6 +1444,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1303,6 +1514,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1340,6 +1579,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1614,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1388,6 +1631,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1452,6 +1696,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1480,8 +1725,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1903,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1927,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2115,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2471,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2303,6 +2585,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..23ab30fa9b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +674,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +702,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -964,28 +990,90 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with "deleting logical tuples" when some (but not all) of the
+ * heap TIDs in an existing posting list item are to be removed by VACUUM.
+ * This works by updating/overwriting an existing item with caller's new
+ * version of the item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updatable, IndexTuple *updated,
+					int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	IndexTuple	itup;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen;
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset;
+
+		updatedbuflen = 0;
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		offset = 0;
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			memcpy(updatedbuf + offset, itup, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber offnum = updatable[i];
+
+		itup = updated[i];
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+
+		if (!PageIndexTupleOverwrite(page, offnum, (Item) itup, itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire physical tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1094,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1109,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatable,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1132,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
 }
 
 /*
@@ -1050,6 +1148,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire physical
+ * tuple can be deleted as a whole (since there is only one LP_DEAD bit per
+ * line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1166,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1213,84 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	/* be tidy */
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2236,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8376a5e6b7..eabb839f43 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+									  int *nremaining);
 
 
 /*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1151,11 +1153,16 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		IndexTuple	updated[MaxIndexTuplesPerPage];
+		OffsetNumber updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1187,8 +1194,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1196,11 +1206,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1223,22 +1231,90 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All table TIDs/logical tuples from the posting
+						 * tuple remain, so no delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this new tuple and the offset of the tuple
+						 * to be updated for the page's _bt_delitems_vacuum()
+						 * call.
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All table TIDs/logical tuples from the posting list
+						 * must be deleted.  We'll delete the physical index
+						 * tuple completely (no update).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updated[i]);
 		}
 		else
 		{
@@ -1251,6 +1327,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1260,15 +1337,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1300,9 +1380,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1311,6 +1392,67 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns new palloc'd array of item pointers needed to build
+ * replacement posting list tuple without the TIDs that VACUUM needs to
+ * delete.  Returned value is NULL in the common case no changes are
+ * needed in caller's posting list tuple (we avoid allocating memory
+ * here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is
+ * set for caller in *nremaining.  This indicates the number of elements
+ * in the returned array (assuming that return value isn't just NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple posting, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(posting);
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live table TID.
+			 *
+			 * Only save live TID when we already know that we're going to
+			 * have to kill at least one TID, and have already allocated
+			 * memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live TIDs skipped in previous iterations, if any */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..362e9d9efa 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->safededup);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +693,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..129fe8668a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1195,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1294,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1309,111 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(Min(BTMaxItemSize(state->btps_page), INDEX_SIZE_MASK),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1421,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..27299c3f75 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplication)}
 
 	};
 
@@ -2118,11 +2184,10 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
@@ -2138,6 +2203,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2232,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2251,18 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include a lastleft heap
+			 * TID in the final pivot, but we can remove the posting list now.
+			 * (Pivot tuples should never contain a posting list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2290,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2301,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2319,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2328,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2410,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2398,22 +2497,36 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2570,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2601,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2675,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the "deduplicate" storage
+	 * parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..9a4b522950 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,56 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState state;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+		state->htids = palloc(state->maxitemsize);
+
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == xlrec->baseoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (!_bt_dedup_save_htid(state, itup))
+				elog(ERROR, "could not add heap tid to pending posting list");
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +533,31 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) updated, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +876,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +899,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +927,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..68fad1c91f 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,25 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff, xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +141,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f47176753d..32ff03b3e4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1055,8 +1055,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1142,7 +1144,8 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	}
 
 	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a4e5d0886a..f44f2ce93f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -1091,6 +1092,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		check_bonjour, NULL, NULL
 	},
+	{
+		{"btree_deduplication", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&btree_deduplication,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
 			gettext_noop("Collects transaction commit time."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 087190ce63..739676b9d0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -651,6 +651,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#btree_deduplication = on
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2fd88866c9..c374c64e2a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1685,14 +1685,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..b533a99300 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +928,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +959,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1001,21 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1027,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1093,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1119,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1152,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1193,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1232,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2044,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2109,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are merged together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2189,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2197,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2653,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..059477be1e 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,122 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values) a
+   storage size of about <emphasis>one third</emphasis> of the
+   unoptimized case is expected.  There is often a direct benefit for
+   queries, as well as an indirect benefit due to reduced I/O during
+   routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   enabled.  The <literal>deduplication</literal> storage parameter
+   can be used to override the configuration paramater for individual
+   indexes.  See <xref linkend="sql-createindex-storage-parameters"/>
+   from the <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.  In
+   practice almost all datatypes support deduplication, though
+   <type>numeric</type> is a notable exception (the <quote>display
+   scale</quote> feature makes it impossible to enable deduplication
+   without losing useful information about equal <type>numeric</type>
+   datums).  Deduplication is not supported with nondeterministic
+   collations, nor is it supported with <literal>INCLUDE</literal>
+   indexes.
+  </para>
+  <para>
+   Note that a multicolumn index is only considered to have duplicates
+   when there are index entries that repeat entire
+   <emphasis>combinations</emphasis> of values (the values stored in
+   each and every column must be equal).
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Internal use of Deduplication in unique indexes</title>
+
+  <para>
+   Page splits that occur due to inserting multiple physical versions
+   (rather than inserting new logical rows) tend to degrade the
+   structure of indexes, especially in the case of unique indexes.
+   Unique indexes use deduplication <emphasis>internally</emphasis>
+   and <emphasis>selectively</emphasis> to delay (and ideally to
+   prevent) these <quote>unnecessary</quote> page splits.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but usually cannot
+   reverse page splits (in general, the page must be completely empty
+   before <command>VACUUM</command> can <quote>delete</quote> it).
+  </para>
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter does not affect whether or not deduplication is used
+   within unique indexes.  The internal use of deduplication for
+   unique indexes is subject to all of the same restrictions as
+   deduplication in general.  The <literal>deduplication</literal>
+   storage parameter can be set to <literal>OFF</literal> to disable
+   deduplication in unique indexes, but this is intended only as a
+   debugging option for developers.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..05f442d57a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8021,6 +8021,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplication" xreflabel="btree_deduplication">
+      <term><varname>btree_deduplication</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>btree_deduplication</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls whether deduplication should be used within B-Tree
+        indexes.  Deduplication is an optimization that reduces the
+        storage size of indexes by storing equal index keys only once.
+        See <xref linkend="btree-deduplication"/> for more
+        information.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..e6cdba4c29 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+    <term><literal>deduplication</literal>
+     <indexterm>
+      <primary><varname>deduplication</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref linkend="guc-btree-deduplication"/>.
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to override GUC.
+      (Alternative spellings of <literal>ON</literal> and
+      <literal>OFF</literal> are allowed as described in <xref
+      linkend="config-setting"/>.)
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplication</literal> off via <command>ALTER
+      INDEX</command> prevents future insertions from triggering
+      deduplication, but does not in itself make existing posting list
+      tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +477,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..e32c8fa826 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..627ba80bc1 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

v28-0002-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/x-patch; name=v28-0002-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From e720ff6cc8faf9697d428074d9922bcc5c6c0f76 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v28 2/3] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 116 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   7 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  53 ++++++++
 doc/src/sgml/pageinspect.sgml                 |  83 +++++++------
 4 files changed, 205 insertions(+), 54 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..337047ff9d 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -557,17 +632,20 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get values of extended metadata if available, use default values
-	 * otherwise.
+	 * otherwise.  Note that we rely on the assumption that btm_safededup is
+	 * initialized to zero on databases that were initdb'd before Postgres 13.
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
+		values[j++] = metad->btm_safededup ? "t" : "f";
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
+		values[j++] = "f";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..92d5c59654 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -12,6 +12,7 @@ fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
+safededup               | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
@@ -41,6 +42,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +58,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..93ea37cde3 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,56 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int4,
+    OUT level int4,
+    OUT fastroot int4,
+    OUT fastlevel int4,
+    OUT oldest_xact int4,
+    OUT last_cleanup_num_tuples real,
+    OUT safededup boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..b527daf6ca 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -300,13 +300,14 @@ test=# SELECT t_ctid, raw_flags, combined_flags
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 582
 last_cleanup_num_tuples | 1000
+safededup               | f
 </screen>
      </para>
     </listitem>
@@ -329,11 +330,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +357,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +415,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

#126

Heikki Linnakangas

hlinnaka@iki.fi

about 6 years ago

In reply to: Peter Geoghegan (#125)

1 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On 04/01/2020 03:47, Peter Geoghegan wrote:

Attached is v28, which fixes bitrot from my recent commits to refactor
VACUUM-related code in nbtpage.c.

I started to read through this gigantic patch. I got about 1/3 way
through. I wrote minor comments directly in the attached patch file,
search for "HEIKKI:". I wrote them as I read the patch from beginning to
end, so it's possible that some of my questions are answered later in
the patch. I didn't have the stamina to read through the whole patch
yet, I'll continue later.

One major design question here is about the LP_DEAD tuples. There's
quite a lot of logic and heuristics and explanations related to unique
indexes. To make them behave differently from non-unique indexes, to
keep the LP_DEAD optimization effective. What if we had a separate
LP_DEAD flag for every item in a posting list, instead? I think we
wouldn't need to treat unique indexes differently from non-unique
indexes, then. I tried to search this thread to see if that had been
discussed already, but I didn't see anyone proposing that approach.

Another important decision here is the on-disk format of these tuples.
The format of IndexTuples on a b-tree page has become really
complicated. The v12 changes to store TIDs in order did a lot of that,
but this makes it even more complicated. I know there are strong
backwards-compatibility reasons for the current format, but
nevertheless, if we were to design this from scratch, what would the
B-tree page and tuple format be like?

- Heikki

Attachments:

v28-0001-Add-deduplication-to-nbtree.patch-with-heikki-commentstext/plain; charset=UTF-8; name=v28-0001-Add-deduplication-to-nbtree.patch-with-heikki-commentsDownload

From 013cda0a1edb18f4315b6ce16f1cd372188ba399 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v28 1/3] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split will be required if deduplication
can't free up enough space.  New "posting list tuples" are formed by
merging together existing duplicate tuples.  The physical representation
of the items on an nbtree leaf page is made more space efficient by
deduplication, but the logical contents of the page are not changed.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  Much larger
reductions in index size are possible in less common cases, where
individual index tuple keys happen to be large.  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.

HEIKKI: seems obvious that the gain can be even better. No need to sell the feature in the commit message.

The lazy approach taken by nbtree has significant advantages over a
GIN-style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The "key space" of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.
Furthermore, deduplication can be turned on or off as needed, or applied    HEIKKI: When would it be needed?
selectively when required.  The split point choice logic doesn't need to
be changed, since posting list tuples are just tuples with payload, much
like tuples with non-key columns in INCLUDE indexes. (nbtsplitloc.c is
still optimized to make intelligent choices in the presence of posting
list tuples, though only because suffix truncation will routinely make
new high keys far far smaller than the non-pivot tuple they're derived
from).

Unique indexes can also make use of deduplication, though the strategy
used has significantly differences.  The high-level goal is to entirely   HEIKKI: why "entirely"? It's good to avoid pages, even if you can't eliminate them entirely, right?
prevent "unnecessary" page splits -- splits caused only by a short term
burst of index tuple versions.  This is often a concern with frequently
updated tables where UPDATEs always modify at least one indexed column
(making it impossible for the table am to use an optimization like
heapam's heap-only tuples optimization).  Deduplication in unique
indexes effectively "buys time" for existing nbtree garbage collection
mechanisms to run and prevent these page splits (the LP_DEAD bit setting
performed during the uniqueness check is the most important mechanism
for controlling bloat with affected workloads).

HEIKKI: Mention that even a unique index can have duplicates, as long as they're not visible to same snapshot. That's not immediately obvious, and if you don't realize that, deduplicating a unique index seems like an oxymoron.

HEIKKI: How do LP_DEAD work on posting list tuples?

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/itup.h                     |   5 +
 src/include/access/nbtree.h                   | 413 ++++++++--
 src/include/access/nbtxlog.h                  |  96 ++-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 151 +++-
 src/backend/access/nbtree/nbtdedup.c          | 738 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 342 +++++++-
 src/backend/access/nbtree/nbtpage.c           | 227 +++++-
 src/backend/access/nbtree/nbtree.c            | 180 ++++-
 src/backend/access/nbtree/nbtsearch.c         | 271 ++++++-
 src/backend/access/nbtree/nbtsort.c           | 202 ++++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 224 +++++-
 src/backend/access/nbtree/nbtxlog.c           | 201 ++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  23 +-
 src/backend/storage/page/bufpage.c            |   9 +-
 src/backend/utils/misc/guc.c                  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 234 +++++-
 doc/src/sgml/btree.sgml                       | 115 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  25 +
 doc/src/sgml/ref/create_index.sgml            |  37 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 29 files changed, 3306 insertions(+), 299 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index b9c41d3455..223f2e7eff 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
  * On such a page, N tuples could take one MAXALIGN quantum less space than
  * estimated here, seemingly allowing one more tuple than estimated here.
  * But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: MaxIndexTuplesPerPage is a limit on the number of tuples on a page
+ * that consume a line pointer -- "physical" tuples.  Some index AMs can store
+ * a greater number of "logical" tuples, though (e.g., btree leaf pages with
+ * posting list tuples).
  */
 #define MaxIndexTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \

HEIKKI: I'm not a fan of the name "logical tuples". Maybe call them TIDs or heap tuple pointers or something?

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f90ee3a0e0..6fae1ec079 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,9 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* GUC parameter */
+extern bool btree_deduplication;
+

HEIKKI: Not a fan of this name. deduplicate_btree_items?

 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,4 +111,5 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;

HEIKKI: When is it not safe?

 #define BTPageGetMeta(p) \
@@ -115,7 +119,8 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *

HEIKKI: Why is it safe to read on version 3 indexes? Because unused space is set to zeros?
HEIKKI: Do we need it as a separate flag, isn't it always safe with version 4 indexes, and never with version 3?

  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
@@ -132,5 +137,5 @@ typedef struct BTMetaPageData
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number in metapage */
 #define BTREE_VERSION	4		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
-#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
+#define BTREE_MIN_VERSION	2	/* minimum supported version */
+#define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */

HEIKKI: I like these comments tweaks, regardless of the rest of the patch. Commit separately?

 /*
  * Maximum size of a btree index entry, including its tuple header.
@@ -156,9 +161,30 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page.  This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound.  A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples.  (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+

HEIKKI: I find this name confusing. There's "IndexTuples" in the name, which makes me think of the IndexTuple struct. But this is explicitly *not* about the number of IndexTuples that fit on a page.
HEIKKI: Maybe "MaxTIDsPerBtreePage" or something ?

 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +256,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.     HEIKKI: What does it mean that they complement pivot tuples?
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -263,9 +288,9 @@ typedef struct BTMetaPageData
  * offset field only stores the number of columns/attributes when the
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
- * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
- * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
- * BTREE_VERSION 4.
+ * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in    HEIKKI: That renaming seems useful, regardless of the rest of the patch
+ * t_info is always set on BTREE_VERSION 4.  BT_PIVOT_HEAP_TID_ATTR can
+ * only be set on BTREE_VERSION 4.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -283,12 +308,37 @@ typedef struct BTMetaPageData
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
  *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication merges together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation.  A posting list is
+ * an array of ItemPointerData elements.  Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT

HEIKKI: I must say, this B-tree index tuple format has become really complicated. I don't like it, but I'm not sure what to do about it.
HEIKKI: I know there are backwards-compatibility reasons for why it's the way it is, but still..

 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
-#define BT_HEAP_TID_ATTR			0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
+#define BT_PIVOT_HEAP_TID_ATTR		0x1000
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_N_POSTING_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_N_POSTING_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined at a macro rather than an inline function to    HEIKKI: typo, should be "as a macro"
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -435,6 +625,11 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is   HEIKKI: Is there really an "index storage parameter" for that? What is that, something in the WITH clause?
+ * currently in use).  This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -470,6 +665,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -508,10 +704,60 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal physical
+ * items from the page, starting from page offset number 'baseoff'.  This is
+ * the offset number of the "base" tuple for new posting list.  'nitems' is
+ * the current total number of existing items from the page that will be
+ * merged to make a new posting list tuple, including the base tuple item.
+ * (Existing physical items may themselves be posting list tuples, or regular
+ * non-pivot tuples.)
+ *
+ * Note that when deduplication merges together existing physical tuples, the
+ * page is modified eagerly.  This makes tracking the details of more than a
+ * single pending posting list at a time unnecessary.  The total size of the
+ * existing tuples to be freed when pending posting list is processed gets
+ * tracked by 'phystupsize'.  This information allows deduplication to
+ * calculate the space saving for each new posting list tuple, and for the
+ * entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing physical tuples */
+	Size		phystupsize;	/* Includes line pointer overhead */
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -535,7 +781,10 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -579,5 +828,5 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxBTreeIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;

HEIKKI: How much memory does this need now? Should we consider pallocing this separately?

 typedef BTScanPosData *BTScanPos;
@@ -687,6 +936,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplication;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -695,8 +945,16 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplication : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique || btree_deduplication)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -743,6 +1001,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buf, BTDedupState state,
+									 bool logged);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -761,14 +1035,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -777,7 +1053,9 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updatable, IndexTuple *updated,
+								int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
@@ -830,6 +1108,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..de855efbba 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,19 +54,32 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;

HEIKKI: Would it be more clear to have a separate struct for the posting list split case?

 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,7 +154,23 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple    HEIKKI: typo: "posing tuple"
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(uint16))

HEIKKI: Do we only generate one posting list in one WAL record? I would assume it's better to deduplicate everything on the page, since we're modifying it anyway.

 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * "logical" tuples within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -152,14 +212,18 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
- *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are how deletes of "logical" tuples contained in an
+ * existing posting list tuple are implemented. (Updates are only used when
+ * there will be some remaining logical tuples once VACUUM finishes; otherwise
+ * the physical posting list tuple can just be deleted).
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES FOR OVERWRITES FOLLOW */
 } xl_btree_vacuum;

HEIKKI: Does this store a whole copy of the remaining posting list on an updated tuple? Wouldn't it be simpler and more space-efficient to store just the deleted TIDs?

-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +309,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..18b1bf5e20 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplication",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..e235750597 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,4 +432,7 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)

HEIKKI: Do we ever do that? Do we ever set the LP_DEAD bit on a posting list tuple?
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,9 +729,155 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+while freeing just as much space.

HEIKKI: I don't understand what the last sentence means. Just as much space as what?

+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.

HEIKKI: Well, it's optimized for that today, but if it was compressed, a btree would be useful in more situations...

+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.
+
+Like all index access methods, nbtree does not have direct knowledge of
+versioning or of MVCC; it deals only with physical tuples.  However, unique
+indexes implicitly give nbtree basic information about tuple versioning,
+since by definition zero or one tuples of any given key value can be
+visible to any possible MVCC snapshot (excluding index entries with NULL
+values).  When optimizations such as heapam's Heap-only tuples (HOT) happen
+to be ineffective, nbtree's on-the-fly deletion of tuples in unique indexes
+can be very important with UPDATE-heavy workloads.  Unique checking's
+LP_DEAD bit setting reliably attempts to kill old, equal index tuple
+versions.  This prevents (or at least delays) page splits that are
+necessary only because a leaf page must contain multiple physical tuples
+for the same logical row.  Deduplication in unique indexes must cooperate
+with this mechanism.  Deleting items on the page is always preferable to
+deduplication.
+
+The strategy used during a deduplication pass has significant differences
+to the strategy used for indexes that can have multiple logical rows with
+the same key value.  We're not really trying to store duplicates in a
+space efficient manner, since in the long run there won't be any
+duplicates anyway.  Rather, we're buying time for garbage collection
+mechanisms to run before a page split is needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves the situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..d9f4e9db38
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,738 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is completely different.
+ * Deduplication works in tandem with garbage collection, especially the
+ * LP_DEAD bit setting that takes place in _bt_check_unique().  We give up as
+ * soon as it becomes clear that enough space has been made available to
+ * insert newitem without needing to split the page.  Also, we merge together
+ * larger groups of duplicate tuples first (merging together two index tuples
+ * usually saves very little space), and avoid merging together existing
+ * posting list tuples.  The goal is to generate posting lists with TIDs that
+ * are "close together in time", in order to maximize the chances of an
+ * LP_DEAD bit being set opportunistically.  See nbtree/README for more
+ * information on deduplication within unique indexes.
+ *
+ * nbtinsert.c caller should call _bt_vacuum_one_page() before calling here.
+ * Note that this routine will delete all items on the page that have their
+ * LP_DEAD bit set, even when page's BTP_HAS_GARBAGE bit is not set (a rare
+ * edge case).  Caller can rely on that to avoid inserting a new tuple that
+ * happens to overlap with an existing posting list tuple with its LP_DEAD bit
+ * set. (Calling here with a newitemsz of 0 will reliably delete the existing
+ * item, making it possible to avoid unsetting the LP_DEAD bit just to insert
+ * the new item.  In general, posting list splits should never have to deal
+ * with a posting list tuple with its LP_DEAD bit set.)
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			count = 0;
+	bool		singlevalue = false;
+
+	/*
+	 * Caller should call _bt_vacuum_one_page() before calling here when it
+	 * looked like there were LP_DEAD items on the page.  However, we can't
+	 * assume that there are no LP_DEAD items (for one thing, VACUUM will
+	 * clear the BTP_HAS_GARBAGE hint without reliably removing items that are
+	 * marked LP_DEAD).  We must be careful to clear all LP_DEAD items because
+	 * posting list splits cannot go ahead if an existing posting list item
+	 * has its LP_DEAD bit set. (Also, we don't want to unnecessarily unset
+	 * LP_DEAD bits when deduplicating items on the page below, though that
+	 * should be harmless.)
+	 *
+	 * The opposite problem is also possible: _bt_vacuum_one_page() won't
+	 * clear the BTP_HAS_GARBAGE bit when it is falsely set (i.e. when there
+	 * are no LP_DEAD bits).  This probably doesn't matter in practice, since
+	 * it's only a hint, and VACUUM will clear it at some point anyway.  Even
+	 * still, we clear the BTP_HAS_GARBAGE hint reliably here. (Seems like a
+	 * good idea for deduplication to only begin when we unambiguously have no
+	 * LP_DEAD items.)
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split by following the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+	else if (P_HAS_GARBAGE(opaque))
+	{
+		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(buf, true);
+	}
+
+	/*
+	 * Return early in case where caller just wants us to kill an existing
+	 * LP_DEAD posting list tuple
+	 */
+	Assert(!P_HAS_GARBAGE(opaque));
+	if (newitemsz == 0)
+		return;
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/*
+	 * Determine if a "single value" strategy page split is likely to occur
+	 * shortly after deduplication finishes.  It should be possible for the
+	 * single value split to find a split point that packs the left half of
+	 * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+	 */
+	if (!checkingunique)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, minoff);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+		{
+			itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+			itup = (IndexTuple) PageGetItem(page, itemid);
+
+			/*
+			 * Use different strategy if future page split likely to need to
+			 * use "single value" strategy
+			 */
+			if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+				singlevalue = true;
+		}
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.  NOTE: We must reassess the max offset
+	 * on each iteration, since the number of items on the page goes down as
+	 * existing items are deduplicated.
+	 */
+	offnum = minoff;
+retry:
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.  The next iteration
+			 * will also end up here if it's possible to merge the next tuple
+			 * into the same pending posting list.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and update the page.  Otherwise, reset
+			 * the state and move on.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buf, state,
+												   RelationNeedsWAL(rel));
+			count++;
+
+			/*
+			 * When caller is a checkingunique caller and we have deduplicated
+			 * enough to avoid a page split, do minimal deduplication in case
+			 * the remaining items are about to be marked dead within
+			 * _bt_check_unique().
+			 */
+			if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Consider special steps when a future page split of the leaf
+			 * page is likely to occur using nbtsplitloc.c's "single value"
+			 * strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Adjust maxitemsize so that there isn't a third and final
+				 * 1/3 of a page width tuple that fills the page to capacity.
+				 * The third tuple produced should be smaller than the first
+				 * two by an amount equal to the free space that nbtsplitloc.c
+				 * is likely to want to leave behind when the page it split.
+				 * When there are 3 posting lists on the page, then we end
+				 * deduplication.  Remaining tuples on the page can be
+				 * deduplicated later, when they're on the new right sibling
+				 * of this page, and the new sibling page needs to be split in
+				 * turn.
+				 *
+				 * Note that it doesn't matter if there are items on the page
+				 * that were already 1/3 of a page during current pass;
+				 * they'll still count as the first two posting list tuples.
+				 */
+				if (count == 2)
+				{
+					Size		leftfree;
+
+					/* This calculation needs to match nbtsplitloc.c */
+					leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+						MAXALIGN(sizeof(BTPageOpaqueData));
+					/* Subtract predicted size of new high key */
+					leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+					/*
+					 * Reduce maxitemsize by an amount equal to target free
+					 * space on left half of page
+					 */
+					state->maxitemsize -= leftfree *
+						((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+				}
+				else if (count == 3)
+					break;
+			}
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buf, state,
+											   RelationNeedsWAL(rel));
+		count++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		Assert(state->checkingunique);
+		offnum = state->skippedbase;
+
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+		state->phystupsize = 0;
+		state->nitems = 0;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list.  A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TIDs from new base tuple for new candidate posting list into
+	 * ipd array.  Assume that we'll eventually create a new posting tuple by
+	 * merging later tuples with this existing one, though we may not.
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* Save size of tuple without any posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all tuples (including line pointer overhead) to
+	 * calculate space savings on page within _bt_dedup_finish_pending().
+	 * Also, save number of base tuple logical tuples so that we can save
+	 * cycles in the common case where an existing posting list can't or won't
+	 * be merged with other tuples on the page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved.  When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple.  (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists in first checkingunique pass */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buf, BTDedupState state, bool logged)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buf);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller.)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->phystupsize - (finalsz + sizeof(ItemIdData));
+		/* Must save some space, and must not exceed tuple limits */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete original items */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple, replacing original items */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buf);
+
+		/* Log deduplicated items */
+		if (logged)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->baseoff;
+			xlrec_dedup.nitems = state->nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in item pointer array as
+ * required.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	/* We only use the key from the base tuple */
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * The calculation used when new tuple has a posting list needs to match
+	 * the code used within _bt_dedup_save_htid().
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() to match index_form_tuple() */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(BTreeTupleIsPosting(itup));
+
+#ifdef USE_ASSERT_CHECKING
+		{
+			/* Verify posting list invariants with assertions */
+			ItemPointerData last;
+			ItemPointer htid;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+				htid = BTreeTupleGetPostingN(itup, i);
+
+				Assert(ItemPointerIsValid(htid));
+				Assert(ItemPointerCompare(htid, &last) > 0);
+				ItemPointerCopy(htid, &last);
+			}
+		}
+#endif
+	}
+	else
+	{
+		/*
+		 * Copy only TID in htids array to header field (i.e. create standard
+		 * non-pivot representation)
+		 */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		Assert(ItemPointerIsValid(&itup->t_tid));
+		ItemPointerCopy(htids, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should probably pass their own
+ * private copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This generally happens inside the same
+ * critical section that performs an in-place update of old posting list using
+ * new posting list returned here).
+ *
+ * Caller should avoid assuming that the IndexTuple-wise key representation in
+ * newitem is bitwise equal to the representation used within oposting.  Note,
+ * in particular, that one may even be larger than the other.  This could
+ * occur due to differences in TOAST input state, for example.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple	nposting;
+
+	Assert(!BTreeTupleIsPivot(newitem));
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original rightmost
+	 * TID)
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/* Fill the gap with the TID of the new item */
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Copy original posting list's rightmost TID into new item */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+					&newitem->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 144d339e8d..51468e0455 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+bool		btree_deduplication = true;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +129,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -300,7 +305,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -353,6 +358,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -374,6 +382,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -435,7 +448,28 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				Assert(!BTreeTupleIsPivot(curitup));
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +545,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -570,12 +603,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +624,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -621,6 +671,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -689,6 +741,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -704,6 +757,8 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		bool		dedupunique = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -717,9 +772,25 @@ _bt_findinsertloc(Relation rel,
 		 * tuple belongs on.  The heap TID attribute for new tuple (scantid)
 		 * could force us to insert on a sibling page, though that should be
 		 * very rare in practice.
+		 *
+		 * checkingunique inserters that encounter a duplicate will apply
+		 * deduplication when it looks like there will be a page split, but
+		 * there is no LP_DEAD garbage on the leaf page to vacuum away (or
+		 * there wasn't enough space freed by LP_DEAD cleanup).  This
+		 * complements the opportunistic LP_DEAD vacuuming mechanism.  The
+		 * high level goal is to avoid page splits caused by new, unchanged
+		 * versions of existing logical rows altogether.  See nbtree/README
+		 * for full details.
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				dedupunique = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -746,18 +817,37 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (helpful when initial page is empty) */
+				dedupunique = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, try to obtain
+		 * enough free space to avoid a page split by deduplicating existing
+		 * items (if deduplication is safe).
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates if checkingunique */
+				dedupunique = true;
+			}
+
+			if (itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz &&
+				(!checkingunique || dedupunique))
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -839,7 +929,36 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  _bt_insertonpg() cannot handle this, so delete all LP_DEAD
+		 * items early.  This is the only case where LP_DEAD deletes happen
+		 * even though a page split wouldn't take place if we went straight to
+		 * the _bt_insertonpg() call.
+		 *
+		 * Call _bt_dedup_one_page() instead of _bt_vacuum_one_page() to force
+		 * deletes (this avoids relying on the BTP_HAS_GARBAGE hint flag,
+		 * which might be falsely unset).  Call can't actually dedup items,
+		 * since we pass a newitemsz of 0.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+						   insertstate->itup, 0, true);
+
+		/*
+		 * Do new binary search, having killed LP_DEAD items.  New insert
+		 * location cannot overlap with any posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -905,10 +1024,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -936,11 +1057,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1079,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1090,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->safededup);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -996,7 +1150,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1071,6 +1226,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1120,8 +1278,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1144,6 +1313,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1322,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1384,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1209,12 +1407,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1234,6 +1444,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1303,6 +1514,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1340,6 +1579,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1614,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1388,6 +1631,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1452,6 +1696,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1480,8 +1725,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1650,8 +1903,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1927,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2115,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2190,6 +2471,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2303,6 +2585,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..23ab30fa9b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +674,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +702,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -964,28 +990,90 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with "deleting logical tuples" when some (but not all) of the
+ * heap TIDs in an existing posting list item are to be removed by VACUUM.
+ * This works by updating/overwriting an existing item with caller's new
+ * version of the item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updatable, IndexTuple *updated,
+					int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	IndexTuple	itup;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen;
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset;
+
+		updatedbuflen = 0;
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		offset = 0;
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			memcpy(updatedbuf + offset, itup, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber offnum = updatable[i];
+
+		itup = updated[i];
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+
+		if (!PageIndexTupleOverwrite(page, offnum, (Item) itup, itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire physical tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1094,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1109,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatable,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1132,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
 }
 
 /*
@@ -1050,6 +1148,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire physical
+ * tuple can be deleted as a whole (since there is only one LP_DEAD bit per
+ * line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1166,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1213,84 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	/* be tidy */
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2236,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8376a5e6b7..eabb839f43 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+									  int *nremaining);
 
 
 /*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxBTreeIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1151,11 +1153,16 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		IndexTuple	updated[MaxIndexTuplesPerPage];
+		OffsetNumber updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1187,8 +1194,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1196,11 +1206,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1223,22 +1231,90 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/*
+					 * Posting list tuple, a physical tuple that represents
+					 * two or more logical tuples, any of which could be an
+					 * index row version that must be removed
+					 */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All table TIDs/logical tuples from the posting
+						 * tuple remain, so no delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this new tuple and the offset of the tuple
+						 * to be updated for the page's _bt_delitems_vacuum()
+						 * call.
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All table TIDs/logical tuples from the posting list
+						 * must be deleted.  We'll delete the physical index
+						 * tuple completely (no update).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updated[i]);
 		}
 		else
 		{
@@ -1251,6 +1327,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1260,15 +1337,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as live tuples).  We don't
+		 * delete when recursing, though, to avoid putting entries into
 		 * freePages out-of-order (doesn't seem worth any extra code to handle
 		 * the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1300,9 +1380,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1311,6 +1392,67 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns new palloc'd array of item pointers needed to build
+ * replacement posting list tuple without the TIDs that VACUUM needs to
+ * delete.  Returned value is NULL in the common case no changes are
+ * needed in caller's posting list tuple (we avoid allocating memory
+ * here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is
+ * set for caller in *nremaining.  This indicates the number of elements
+ * in the returned array (assuming that return value isn't just NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple posting, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(posting);
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live table TID.
+			 *
+			 * Only save live TID when we already know that we're going to
+			 * have to kill at least one TID, and have already allocated
+			 * memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live TIDs skipped in previous iterations, if any */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..362e9d9efa 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->safededup);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +693,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxBTreeIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * "logical" tuple.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional logical tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every logical
+	 * tuple that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..129fe8668a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  On the other hand, non-unique index builds
+			 * usually deduplicate, which often results in every "physical"
+			 * tuple on the page having distinct key values.  When that
+			 * happens, _bt_truncate() will never need to include a heap TID
+			 * in the new high key.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	IndexTuple	final;
+	Size		truncextra;
+
+	Assert(dstate->nitems > 0);
+	truncextra = 0;
+	if (dstate->nitems == 1)
+		final = dstate->base;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		final = postingtuple;
+		/* Determine size of posting list */
+		truncextra = IndexTupleSize(final) -
+			BTreeTupleGetPostingOffset(final);
+	}
+
+	_bt_buildadd(wstate, state, final, truncextra);
+
+	if (dstate->nitems > 1)
+		pfree(final);
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1195,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1294,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1309,111 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+		IndexTuple	newbase;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path, but it allows us to get most of the
+				 * space savings deduplication provides without noticeably
+				 * impacting how much free space is left behind on each leaf
+				 * page.
+				 */
+				dstate->maxitemsize =
+					Min(Min(BTMaxItemSize(state->btps_page), INDEX_SIZE_MASK),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size used here is arbitrary: */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/*
+				 * No previous/base tuple, since itup is the  first item
+				 * returned by the tuplesort -- use itup as base tuple of
+				 * first pending posting list for entire index build
+				 */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list, and
+				 * merging itup into pending posting list won't exceed the
+				 * maxitemsize limit.  Heap TID(s) for itup have been saved in
+				 * state.  The next iteration will also end up here if it's
+				 * possible to merge the next tuple into the same pending
+				 * posting list.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * maxitemsize limit was reached
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				/* Base tuple is always a copy */
+				pfree(dstate->base);
+
+				/* itup starts new pending posting list */
+				newbase = CopyIndexTuple(itup);
+				_bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		/*
+		 * Handle the last item (there must be a last item when the tuplesort
+		 * returned one or more tuples)
+		 */
+		if (state)
+		{
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			/* Base tuple is always a copy */
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1421,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..27299c3f75 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplication", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplication)}
 
 	};
 
@@ -2118,11 +2184,10 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
@@ -2138,6 +2203,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2232,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2251,18 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include a lastleft heap
+			 * TID in the final pivot, but we can remove the posting list now.
+			 * (Pivot tuples should never contain a posting list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2290,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2301,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2319,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2328,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2410,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2398,22 +2497,36 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2570,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2601,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2675,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the "deduplicate" storage
+	 * parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..9a4b522950 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,56 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState state;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+		state->htids = palloc(state->maxitemsize);
+
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == xlrec->baseoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (!_bt_dedup_save_htid(state, itup))
+				elog(ERROR, "could not add heap tid to pending posting list");
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +533,31 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) updated, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +876,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +899,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +927,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..68fad1c91f 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,25 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff, xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +141,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f47176753d..32ff03b3e4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1055,8 +1055,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1142,7 +1144,8 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	}
 
 	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a4e5d0886a..f44f2ce93f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -1091,6 +1092,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		check_bonjour, NULL, NULL
 	},
+	{
+		{"btree_deduplication", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&btree_deduplication,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
 			gettext_noop("Collects transaction commit time."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 087190ce63..739676b9d0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -651,6 +651,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#btree_deduplication = on
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2fd88866c9..c374c64e2a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1685,14 +1685,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplication",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplication =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..b533a99300 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples.  heapallindexed
+		 * verification fingerprints posting list heap TIDs as plain non-pivot
+		 * tuples, complete with index keys.  This allows its heap scan to
+		 * behave as if posting lists do not exist.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +928,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +959,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1001,21 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1027,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1093,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1119,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "logical" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_logical_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1152,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1193,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1232,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1251,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2044,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2109,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple logical index tuples are merged together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples.  Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2189,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2197,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2653,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..059477be1e 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,122 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values) a
+   storage size of about <emphasis>one third</emphasis> of the
+   unoptimized case is expected.  There is often a direct benefit for
+   queries, as well as an indirect benefit due to reduced I/O during
+   routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   enabled.  The <literal>deduplication</literal> storage parameter
+   can be used to override the configuration paramater for individual
+   indexes.  See <xref linkend="sql-createindex-storage-parameters"/>
+   from the <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.  In
+   practice almost all datatypes support deduplication, though
+   <type>numeric</type> is a notable exception (the <quote>display
+   scale</quote> feature makes it impossible to enable deduplication
+   without losing useful information about equal <type>numeric</type>
+   datums).  Deduplication is not supported with nondeterministic
+   collations, nor is it supported with <literal>INCLUDE</literal>
+   indexes.
+  </para>
+  <para>
+   Note that a multicolumn index is only considered to have duplicates
+   when there are index entries that repeat entire
+   <emphasis>combinations</emphasis> of values (the values stored in
+   each and every column must be equal).
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Internal use of Deduplication in unique indexes</title>
+
+  <para>
+   Page splits that occur due to inserting multiple physical versions
+   (rather than inserting new logical rows) tend to degrade the
+   structure of indexes, especially in the case of unique indexes.
+   Unique indexes use deduplication <emphasis>internally</emphasis>
+   and <emphasis>selectively</emphasis> to delay (and ideally to
+   prevent) these <quote>unnecessary</quote> page splits.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but usually cannot
+   reverse page splits (in general, the page must be completely empty
+   before <command>VACUUM</command> can <quote>delete</quote> it).
+  </para>
+  <para>
+   The <xref linkend="guc-btree-deduplication"/> configuration
+   parameter does not affect whether or not deduplication is used
+   within unique indexes.  The internal use of deduplication for
+   unique indexes is subject to all of the same restrictions as
+   deduplication in general.  The <literal>deduplication</literal>
+   storage parameter can be set to <literal>OFF</literal> to disable
+   deduplication in unique indexes, but this is intended only as a
+   debugging option for developers.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..05f442d57a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8021,6 +8021,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplication" xreflabel="btree_deduplication">
+      <term><varname>btree_deduplication</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>btree_deduplication</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls whether deduplication should be used within B-Tree
+        indexes.  Deduplication is an optimization that reduces the
+        storage size of indexes by storing equal index keys only once.
+        See <xref linkend="btree-deduplication"/> for more
+        information.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..e6cdba4c29 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+    <term><literal>deduplication</literal>
+     <indexterm>
+      <primary><varname>deduplication</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref linkend="guc-btree-deduplication"/>.
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to override GUC.
+      (Alternative spellings of <literal>ON</literal> and
+      <literal>OFF</literal> are allowed as described in <xref
+      linkend="config-setting"/>.)
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplication</literal> off via <command>ALTER
+      INDEX</command> prevents future insertions from triggering
+      deduplication, but does not in itself make existing posting list
+      tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +477,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..e32c8fa826 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..627ba80bc1 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

#127

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Heikki Linnakangas (#126)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Jan 8, 2020 at 5:56 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/01/2020 03:47, Peter Geoghegan wrote:

Attached is v28, which fixes bitrot from my recent commits to refactor
VACUUM-related code in nbtpage.c.

I started to read through this gigantic patch.

Oh come on, it's not that big. :-)

I got about 1/3 way
through. I wrote minor comments directly in the attached patch file,
search for "HEIKKI:". I wrote them as I read the patch from beginning to
end, so it's possible that some of my questions are answered later in
the patch. I didn't have the stamina to read through the whole patch
yet, I'll continue later.

Thanks for the review! Anything that you've written that I do not
respond to directly can be assumed to have been accepted by me.

I'll start with responses to the points that you raise in your patch
that need a response

Patch comments
==============

* Furthermore, deduplication can be turned on or off as needed, or
applied HEIKKI: When would it be needed?

I believe that hardly anybody will want to turn off deduplication in
practice. My point here is that we're flexible -- we're not
maintaining posting lists like GIN. We're just deduplicating as and
when needed. We can change our preference about that any time. Turning
off deduplication won't magically undo past deduplications, of course,
but everything mostly works in the same way when deduplication is on
or off. We're just talking about an alternative physical
representation of the same logical contents.

* HEIKKI: How do LP_DEAD work on posting list tuples?

Same as before, except that it applies to all TIDs in the tuple
together (will mention this in commit message, though). Note that the
fact that we delay deduplication also means that we delay merging the
LP_DEAD bits. And we always prefer to remove LP_DEAD items. Finally,
we refuse to do a posting list split when its LP_DEAD bit is set, so
it's now possible to delete LP_DEAD bit set tuples a little early,
before a page split has to be avoided -- see the new code and comments
at the end of _bt_findinsertloc().

See also: my later response to your e-mail remarks on LP_DEAD bits,
unique indexes, and space accounting.

* HEIKKI: When is it [deduplication] not safe?

With opclasses like btree/numeric_ops, where display scale messes
things up. See this thread for more information on the infrastructure
that we need for that:

/messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com

* HEIKKI: Why is it safe to read on version 3 indexes? Because unused
space is set to zeros?

Yes. Same applies to version 4 indexes that come from Postgres 12 --
users must REINDEX to call _bt_opclasses_support_dedup() and set
metapage field, but we can rely on the new field being all zeroes
before that happens. (It would be possible to teach pg_upgrade to set
the field for compatible indexes from Postgres 12, but I don't want to
bother with that. We probably cannot safely call
_bt_opclasses_support_dedup() with a buffer lock held, so that seems
like the only way.)

* HEIKKI: Do we need it as a separate flag, isn't it always safe with
version 4 indexes, and never with version 3?

No, it isn't *always* safe with version 4 indexes, for reasons that
have nothing to do with the on-disk representation (like the display
scale issue, nondeterministic collations, etc). It really is a
distinct condition. (Deduplication is never safe with version 3
indexes, obviously.)

It occurs to me now that we probably don't even want to make the
metapage field about deduplication (though that's what it says right
now). Rather, it should be about supporting a general category of
optimizations that include deduplication, and might also include
prefix compression in the future. Note that whether or not we should
actually apply these optimizations is always a separate question.

* + * Non-pivot tuples complement pivot tuples, which only have key
columns. HEIKKI: What does it mean that they complement pivot
tuples?

It means that all tuples are either pivot tuples, or are non-pivot tuples.

* + * safely (index storage parameter separately indicates if
deduplication is HEIKKI: Is there really an "index storage
parameter" for that? What is that, something in the WITH clause?

Yes, there is actually an index storage parameter named
"deduplication" (something in the WITH clause). This is deliberately
not named "btree_deduplication", the current name of the GUC. This
exists to make the optimization controllable at the index level.
(Though I should probably mention the GUC first in this code comment,
or not even mention the less significant storage parameter.)

* HEIKKI: How much memory does this [BTScanPosData.items array of
width MaxBTreeIndexTuplesPerPage] need now? Should we consider
pallocing this separately?

But BTScanPosData isn't allocated on the stack anyway.

* HEIKKI: Would it be more clear to have a separate struct for the
posting list split case? (i.e. don't reuse xl_btree_insert)

I doubt it, but I'm open to it. We don't do it that way in a number of
existing cases.

* HEIKKI: Do we only generate one posting list in one WAL record? I
would assume it's better to deduplicate everything on the page, since
we're modifying it anyway.

You might be right about that. Let me get back to you on that.

HEIKKI: Does this [xl_btree_vacuum WAL record] store a whole copy of
the remaining posting list on an updated tuple? Wouldn't it be simpler
and more space-efficient to store just the deleted TIDs?

It would certainly be more space efficient in cases where we delete
some but not all TIDs -- hard to know how much that matters. Don't
think that it would be simpler, though.

I have an open mind about this. I can try it the other way if you like.

* HEIKKI: Do we ever do that? Do we ever set the LP_DEAD bit on a
posting list tuple?

As I said, we are able to set LP_DEAD bits on posting list tuples, if
and only if all the TIDs are dead (i.e. if all-but-one TID is dead, it
cannot be set). This limitation does not seem to matter in practice,
in part because LP_DEAD bits can be set before we deduplicate --
that's another benefit of delaying deduplication until the point where
we'd usually have to split the page.

See also: my later response to your e-mail remarks on LP_DEAD bits,
unique indexes, and space accounting.

* HEIKKI: Well, it's optimized for that today, but if it [a posting
list] was compressed, a btree would be useful in more situations...

I agree, but I think that we should do compression by inventing a new
type of leaf page that only stores TIDs, and use that when we do a
single value mode split in nbtsplitloc.c. So we don't even use tuples
at that point (except the high key), and we compress the entire page.
That way, we don't have to worry about posting list splits and stuff
like that, which seems like the best of both worlds. Maybe we can use
a true bitmap on these special leaf pages.

... Now to answer the feedback from your actual e-mail ...

E-mail
======

One major design question here is about the LP_DEAD tuples. There's
quite a lot of logic and heuristics and explanations related to unique
indexes.

The unique index stuff hasn't really been discussed on the thread
until now. Those parts are all my work.

To make them behave differently from non-unique indexes, to
keep the LP_DEAD optimization effective. What if we had a separate
LP_DEAD flag for every item in a posting list, instead? I think we
wouldn't need to treat unique indexes differently from non-unique
indexes, then.

I don't think that that's quite true -- it's not so much about LP_DEAD
bits as it is about our *goals* with unique indexes. We have no reason
to deduplicate other than to delay an immediate page split, so it
isn't really about space efficiency. Having individual LP_DEAD bits
for each TID wouldn't change the picture for _bt_dedup_one_page() -- I
would still want a checkingunique flag there. But individual LP_DEAD
bits would make a lot of other things much more complicated. Unique
indexes are kind of special, in general.

The thing that I prioritized keeping simple in the patch is page space
accounting, particularly the nbtsplitloc.c logic, which doesn't need
any changes to continue to work (it's also important for page space
accounting to be simple within _bt_dedup_one_page()). I did teach
nbtsplitloc.c to take posting lists from the firstright tuple into
account, but only because they're often unusually large, making it a
worthwhile optimization. Exactly the same thing could take place with
non-key INCLUDE columns, but typically the extra payload is not very
large, so I haven't bothered with that before now.

If you had a "supplemental header" to store per-TID LP_DEAD bits, that
would make things complicated for page space accounting. Even if it
was only one byte, you'd have to worry about it taking up an entire
extra MAXALIGN() quantum within _bt_dedup_one_page(). And then there
is the complexity within _bt_killitems(), needed to make the
kill_prior_tuple stuff work. I might actually try to do it that way if
I thought that it would perform better, or be simpler than what I came
up with. I doubt that, though.

In summary: while it would be possible to have per-TID LP_DEAD bits,
but I don't think it would be even remotely worth it. I can go into my
doubts about the performance benefits if you want.

Note also: I tend to think of the LP_DEAD bit setting within
_bt_check_unique() as almost a separate optimization to the
kill_prior_tuple stuff, even though they both involve LP_DEAD bits.
The former is much more important than the latter. The
kill_prior_tuple thing was severely regressed in Postgres 9.5 without
anyone really noticing [1]/messages/by-id/CAH2-Wz=SfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA@mail.gmail.com -- Peter Geoghegan.

Another important decision here is the on-disk format of these tuples.
The format of IndexTuples on a b-tree page has become really
complicated. The v12 changes to store TIDs in order did a lot of that,
but this makes it even more complicated.

It adds two new functions: BTreeTupleIsPivot(), and
BTreeTupleIsPosting(). This means that there are three basic kinds of
B-Tree tuple layout. We can detect which kind any given tuple is in a
low context way. The three possible cases are:

* Pivot tuples.

* Posting list tuples (non-pivot tuples that have at least two head TIDs).

* Regular/small non-pivot tuples -- this representation has never
changed in all the time I've worked on Postgres.

You'll notice that there are lots of new assertions, including in
places that don't have anything to do with the new code --
BTreeTupleIsPivot() and BTreeTupleIsPosting() assertions.

I think that there is only really one wart here that tends to come up
outside the nbtree.h code itself again and again: the fact that
!heapkeyspace indexes may give false negatives when
BTreeTupleIsPivot() is used. So any BTreeTupleIsPivot() assertion has
to include some nearby heapkeyspace field to cover that case (or else
make sure that the index is a v4+/heapkeyspace index in some other
way). Note, however, that we can safely assert !BTreeTupleIsPivot() --
that won't produce spurious assertion failures with heapkeyspace
indexes. Note also that the new BTreeTupleIsPosting() function works
reliably on all B-Tree versions.

The only future requirements that I can anticipate for the tuple format in are:

1. The need to support wider TIDs. (I am strongly of the opinion that
this shouldn't work all that differently to what we have now.)

2. The need for a page-level prefix compression feature. This can work
by requiring decompression code to assume that the common prefix for
the page just isn't present.

This seems doable within the confines of the current/proposed B-Tree
tuple format. Though we still need to have a serious discussion about
the future of TIDs in light of stuff like ZedStore. I think that fully
logical table identifiers are worth supporting, but they had better
behave pretty much like a TID within index access method code -- they
better show temporal and spatial locality in about the same way TIDs
do. They should be compared as generic integers, and accept reasonable
limits on TID width. It should be possible to do cheap binary searches
on posting lists in about the same way.

I know there are strong
backwards-compatibility reasons for the current format, but
nevertheless, if we were to design this from scratch, what would the
B-tree page and tuple format be like?

That's a good question, but my answer depends on the scope of the question.

If you define "from scratch" to mean "5 years ago", then I believe
that it would be exactly the same as what we have now. I specifically
anticipated the need to have posting list TIDs (existing v12 era
comments in nbtree.h and amcheck things about posting lists). And what
I came up with is almost the same as the GIN format, except that we
have explicit pivot tuples (to make suffix truncation work), and use
the 13th IndexTupleData header bit (INDEX_ALT_TID_MASK) in a way that
makes it possible to store non-pivot tuples in a space-efficient way
when they are all unique. A plain GIN tuple used an extra MAXALIGN()
quantum to store an entry tree tuple that only has one TID.

If, on the other hand, you're talking about a totally green field
situation, then I would probably not use IndexTuple at all. I think
that a representation that stores offsets right in the tuple header
(so no separate varlena headers) has more advantages than
disadvantages. It would make it easier to do both suffix truncation
and prefix compression. It also makes it cheap to skip to the end of
the tuple. In general, it would be nice if the IndexTupleData TID was
less special, but that assumption is baked into a lot of code -- most
of which is technically not in nbtree.

We expect very specific things about the alignment of TIDs -- they are
assumed to be 3 SHORTALIGN()'d uint16 fields. Within nbtree, we assume
SHORTALIGN()'d access to the t_info field by IndexTupleSize() will be
okay within btree_xlog_split(). I bet that there are a number of
subtle assumptions about our use of IndexTupleData + ItemPointerData
that we have no idea about right now. So changing it won't be that
easy.

As for page level stuff, I believe that we mostly do things the right
way already. I would prefer it if the line pointer array was at the
end of the page so that tuples could go at the start of the page, and
be appending front to back (maybe the special area would still be at
the end). That's a very general issue, though -- Andres says that that
would help branch prediction, though I'm not sure of the details
offhand.

Questions
=========

Finally, I have some specific questions for you about the patch:

1. How do you feel about the design of posting list splits, and my
explanation of that design in the nbtree README?

2. How do you feel about the idea of stopping VACUUM from clearing the
BTP_HAS_GARBAGE page level flag?

I suspect that it's much better to have it falsely set than to have it
falsely unset. The extra cost is that we do a useless extra call to
_bt_vacuum_one_page(), but that's very cheap in the context of having
to deal with a page that's full, that we might have to split (or
deduplicate) anyway. But the extra benefit could perhaps be quite
large. This question doesn't really have that much to do with
deduplication.

[1]: /messages/by-id/CAH2-Wz=SfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#128

Peter Geoghegan

pg@bowt.ie

about 6 years ago

In reply to: Peter Geoghegan (#127)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Jan 8, 2020 at 2:56 PM Peter Geoghegan <pg@bowt.ie> wrote:

Thanks for the review! Anything that you've written that I do not
respond to directly can be assumed to have been accepted by me.

Here is a version with most of the individual changes you asked for --
this is v29. I just pushed a couple of small tweaks to nbtree.h, that
you suggested I go ahead with immediately. v29 also refactors some of
the "single value strategy" stuff in nbtdedup.c. This is code that
anticipates the needs of nbtsplitloc.c's single value strategy --
deduplication is designed to work together with page
splits/nbtsplitloc.c.

Still, v29 doesn't resolve the following points you've raised, where I
haven't reached a final opinion on what to do myself. These items are
as follows (I'm quoting your modified patch file sent on January 8th
here):

* HEIKKI: Do we only generate one posting list in one WAL record? I
would assume it's better to deduplicate everything on the page, since
we're modifying it anyway.

* HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the
remaining posting list on an updated tuple? Wouldn't it be simpler and
more space-efficient to store just the deleted TIDs?

* HEIKKI: Would it be more clear to have a separate struct for the
posting list split case? (i.e. don't reuse xl_btree_insert)

v29 of the patch also doesn't change anything about how LP_DEAD bits
work, apart from going into the LP_DEAD stuff in the commit message.
This doesn't seem to be in the same category as the other three open
items, since it seems like we disagree here -- that must be worked out
through further discussion and/or benchmarking.

--
Peter Geoghegan

Attachments:

v29-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v29-0001-Add-deduplication-to-nbtree.patchDownload

From 8133c6b7226624abcf8509acd47cf229114fd345 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v29 1/3] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
"posting list tuples" are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.

The lazy approach taken by nbtree has significant advantages over a
GIN style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.
Furthermore, deduplication can easily be turned on or off.  The split
point choice logic doesn't need to be changed, since posting list tuples
are just tuples with payload, much like tuples with non-key columns in
INCLUDE indexes.  (nbtsplitloc.c is still optimized to make intelligent
choices in the presence of posting list tuples, though only because
suffix truncation will routinely make new high keys far far smaller than
the non-pivot tuple they're derived from).

In general, nbtree unique indexes sometimes need to store multiple equal
(non-NULL) tuples for the same logical row (one per physical row
version).  Unique indexes can use deduplication specifically to merge
together multiple physical versions (index tuples), though the overall
strategy used there is somewhat different.  The high-level goal with
unique indexes is to prevent "unnecessary" page splits -- splits caused
only by a short term burst of index tuple versions.  This is often a
concern with frequently updated tables where UPDATEs always modify at
least one indexed column (making it impossible for the table am to use
an optimization like heapam's heap-only tuples optimization).
Deduplication in unique indexes effectively "buys time" for existing
nbtree garbage collection mechanisms to run and prevent these page
splits (the LP_DEAD bit setting performed during the uniqueness check is
the most important mechanism for controlling bloat with affected
workloads).

Since posting list tuples have only one line pointer (just like any
other tuple), they have only one LP_DEAD bit.  The LP_DEAD bit can still
be set by both unique checking and the kill_prior_tuple optimization,
but only when all heap TIDs are dead-to-all.  This "loss of granularity"
for LP_DEAD bits is considered an acceptable downside of the
deduplication design.  We always prefer deleting LP_DEAD items to a
deduplication pass, and a deduplication pass can only take place at the
point where we'd previously have had to split the page, so any workload
that pays a cost here must also get a significant benefit.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since deduplication only affects the physical
representation of tuples.  However, users must still REINDEX a
pg_upgrade'd index to before its leaf page splits will apply
deduplication.  An index build is the only way to set the new nbtree
metapage flag indicating that deduplication is generally safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h                   | 410 ++++++++--
 src/include/access/nbtxlog.h                  |  96 ++-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 151 +++-
 src/backend/access/nbtree/nbtdedup.c          | 774 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 342 +++++++-
 src/backend/access/nbtree/nbtpage.c           | 224 ++++-
 src/backend/access/nbtree/nbtree.c            | 180 +++-
 src/backend/access/nbtree/nbtsearch.c         | 271 +++++-
 src/backend/access/nbtree/nbtsort.c           | 190 ++++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 228 +++++-
 src/backend/access/nbtree/nbtxlog.c           | 202 ++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  23 +-
 src/backend/storage/page/bufpage.c            |   9 +-
 src/backend/utils/misc/guc.c                  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 230 +++++-
 doc/src/sgml/btree.sgml                       | 116 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  25 +
 doc/src/sgml/ref/create_index.sgml            |  38 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 28 files changed, 3318 insertions(+), 303 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 20ace69dab..3d7477442c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,9 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* GUC parameter */
+extern bool deduplicate_btree_items;
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,6 +111,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -115,14 +119,17 @@ typedef struct BTMetaPageData
 
 /*
  * The current Btree version is 4.  That's what you'll get when you create
- * a new index.
+ * a new index.  The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
  *
  * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
  * version 4, but heap TIDs were not part of the keyspace.  Index tuples
  * with duplicate keys could be stored in any order.  We continue to
  * support reading and writing Btree versions 2 and 3, so that they don't
  * need to be immediately re-indexed at pg_upgrade.  In order to get the
- * new heapkeyspace semantics, however, a REINDEX is needed.
+ * new heapkeyspace semantics, however, a REINDEX is needed.  Even version
+ * 4 indexes created on Postgres 12 will need a REINDEX in order to use
+ * deduplication (pg_upgrade won't set btm_safededup in metapage for us).
  *
  * Btree version 2 is mostly the same as version 3.  There are two new
  * fields in the metapage that were introduced in version 3.  A version 2
@@ -156,6 +163,23 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
+ * that may be stored on a btree leaf page.  It is used to size the
+ * per-page temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.  There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs.
+ */
+#define MaxTIDsPerBTreePage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +254,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -264,7 +287,8 @@ typedef struct BTMetaPageData
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
  * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
- * t_info is always set on BTREE_VERSION 4 pivot tuples.
+ * t_info is always set on BTREE_VERSION 4 pivot tuples, since
+ * BTreeTupleIsPivot() must work reliably on heapkeyspace versions.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -279,90 +303,256 @@ typedef struct BTMetaPageData
  * The 12 least significant offset bits from t_tid are used to represent
  * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
- * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
- * number of columns/attributes <= INDEX_MAX_KEYS.
+ * future use.  BT_OFFSET_MASK should be large enough to store any number
+ * of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication merges together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation.  A posting list is
+ * an array of ItemPointerData elements.  Non-pivot tuples are merged
+ * together to form posting list tuples lazily, at the point where we'd
+ * otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).  BT_OFFSET_MASK should be large enough to store
+ * any number of posting list TIDs that might be present in a tuple (since
+ * tuple size is subject to the INDEX_SIZE_MASK limit).
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
-#define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_OFFSET_MASK				0x0FFF
 #define BT_PIVOT_HEAP_TID_ATTR		0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined as a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
-			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_PIVOT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_PIVOT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -434,6 +624,9 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use deduplication safely.
+ * This is also a property of the index relation rather than an indexscan.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +662,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -507,10 +701,60 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal physical
+ * items from the page, starting from page offset number 'baseoff'.  This is
+ * the offset number of the "base" tuple for new posting list.  'nitems' is
+ * the current total number of existing items from the page that will be
+ * merged to make a new posting list tuple, including the base tuple item.
+ * (Existing physical items may themselves be posting list tuples, or regular
+ * non-pivot tuples.)
+ *
+ * Note that when deduplication merges together existing physical tuples, the
+ * page is modified eagerly.  This makes tracking the details of more than a
+ * single pending posting list at a time unnecessary.  The total size of the
+ * existing tuples to be freed when pending posting list is processed gets
+ * tracked by 'phystupsize'.  This information allows deduplication to
+ * calculate the space saving for each new posting list tuple, and for the
+ * entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	Size		maxitemsize;	/* Limit on size of final tuple */
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without original posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing physical tuples */
+	Size		phystupsize;	/* Includes line pointer overhead */
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -534,7 +778,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list
+ * tuple.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -578,7 +824,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -686,6 +932,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplicate_items;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -694,8 +941,16 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplicate_items : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique || deduplicate_btree_items)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -742,6 +997,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buf, BTDedupState state,
+									 bool logged);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -760,14 +1031,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -776,7 +1049,9 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updatable, IndexTuple *updated,
+								int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
@@ -829,6 +1104,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..51f76c055a 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +154,32 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posting tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * the TIDs within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -152,19 +212,23 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
- *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are how deletes of a subset of TIDs contained in an
+ * existing posting list tuple are implemented. (Updates are only used when
+ * there will be some remaining TIDs once VACUUM finishes; otherwise the
+ * physical posting list tuple can just be deleted).
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES FOR OVERWRITES FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +309,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..f2b03a6cfc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplicate_items",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..ff7c54f5a8 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every table TID within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,152 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.
+
+Like all index access methods, nbtree does not have direct knowledge of
+versioning or of MVCC; it deals only with physical tuples.  However, unique
+indexes implicitly give nbtree basic information about tuple versioning,
+since by definition zero or one tuples of any given key value can be
+visible to any possible MVCC snapshot (excluding index entries with NULL
+values).  When optimizations such as heapam's Heap-only tuples (HOT) happen
+to be ineffective, nbtree's on-the-fly deletion of tuples in unique indexes
+can be very important with UPDATE-heavy workloads.  Unique checking's
+LP_DEAD bit setting reliably attempts to kill old, equal index tuple
+versions.  This prevents (or at least delays) page splits that are
+necessary only because a leaf page must contain multiple physical tuples
+for the same logical row.  Deduplication in unique indexes must cooperate
+with this mechanism.  Deleting items on the page is always preferable to
+deduplication.
+
+The strategy used during a deduplication pass has significant differences
+to the strategy used for indexes that can have multiple logical rows with
+the same key value.  We're not really trying to store duplicates in a
+space efficient manner, since in the long run there won't be any
+duplicates anyway.  Rather, we're buying time for garbage collection
+mechanisms to run before a page split is needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves the situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost/max
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..2790ccccbb
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,774 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+static bool _bt_use_singlevalue(Relation rel, Page page, BTDedupState state,
+								OffsetNumber minoff, IndexTuple newitem);
+static void _bt_singlevalue_adjust(Page page, BTDedupState state,
+								   Size newitemsz);
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times.  It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is completely different.
+ * Deduplication works in tandem with garbage collection, especially the
+ * LP_DEAD bit setting that takes place in _bt_check_unique().  We give up as
+ * soon as it becomes clear that enough space has been made available to
+ * insert newitem without needing to split the page.  Also, we merge together
+ * larger groups of duplicate tuples first (merging together two index tuples
+ * usually saves very little space), and avoid merging together existing
+ * posting list tuples.  The goal is to generate posting lists with TIDs that
+ * are "close together in time", in order to maximize the chances of an
+ * LP_DEAD bit being set opportunistically.  See nbtree/README for more
+ * information on deduplication within unique indexes.
+ *
+ * nbtinsert.c caller should call _bt_vacuum_one_page() before calling here.
+ * Note that this routine will delete all items on the page that have their
+ * LP_DEAD bit set, even when page's BTP_HAS_GARBAGE bit is not set (a rare
+ * edge case).  Caller can rely on that to avoid inserting a new tuple that
+ * happens to overlap with an existing posting list tuple with its LP_DEAD bit
+ * set. (Calling here with a newitemsz of 0 will reliably delete the existing
+ * item, making it possible to avoid unsetting the LP_DEAD bit just to insert
+ * the new item.  In general, posting list splits should never have to deal
+ * with a posting list tuple with its LP_DEAD bit set.)
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			pagenitems = 0;
+	bool		singlevalue = false;
+
+	/*
+	 * Caller should call _bt_vacuum_one_page() before calling here when it
+	 * looked like there were LP_DEAD items on the page.  However, we can't
+	 * assume that there are no LP_DEAD items (for one thing, VACUUM will
+	 * clear the BTP_HAS_GARBAGE hint without reliably removing items that are
+	 * marked LP_DEAD).  We must be careful to clear all LP_DEAD items because
+	 * posting list splits cannot go ahead if an existing posting list item
+	 * has its LP_DEAD bit set. (Also, we don't want to unnecessarily unset
+	 * LP_DEAD bits when deduplicating items on the page below, though that
+	 * should be harmless.)
+	 *
+	 * The opposite problem is also possible: _bt_vacuum_one_page() won't
+	 * clear the BTP_HAS_GARBAGE bit when it is falsely set (i.e. when there
+	 * are no LP_DEAD bits).  This probably doesn't matter in practice, since
+	 * it's only a hint, and VACUUM will clear it at some point anyway.  Even
+	 * still, we clear the BTP_HAS_GARBAGE hint reliably here. (Seems like a
+	 * good idea for deduplication to only begin when we unambiguously have no
+	 * LP_DEAD items.)
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split using the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+	else if (P_HAS_GARBAGE(opaque))
+	{
+		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(buf, true);
+	}
+
+	/*
+	 * Return early in case where caller just wants us to kill an existing
+	 * LP_DEAD posting list tuple
+	 */
+	Assert(!P_HAS_GARBAGE(opaque));
+	if (newitemsz == 0)
+		return;
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = NULL;
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	/* Conservatively size array */
+	state->htids = palloc(state->maxitemsize);
+
+	/* Determine if "single value" strategy should be used */
+	if (!checkingunique)
+		singlevalue = _bt_use_singlevalue(rel, page, state, minoff, newitem);
+
+	offnum = minoff;
+retry:
+
+	/*
+	 * Deduplicate items, starting from offnum.
+	 *
+	 * NOTE: We deliberately reassess the max offset number on each iteration.
+	 * The number of items on the page goes down as existing items are
+	 * deduplicated.
+	 */
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed BTMaxItemSize()
+			 * limit).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and actually update the page.  Else
+			 * reset the state and move on without modifying the page.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buf, state,
+												   RelationNeedsWAL(rel));
+			pagenitems++;
+
+			/*
+			 * Consider final space utilization when a future page split of
+			 * the leaf page is likely to occur using nbtsplitloc.c's "single
+			 * value" strategy
+			 */
+			if (singlevalue)
+			{
+				/*
+				 * Lower maxitemsize for third and final item that might be
+				 * deduplicated by current deduplication pass.  When third
+				 * item formed/observed, end pass.
+				 *
+				 * NOTE: it's possible that this will be reached even when
+				 * current deduplication pass has yet to modify the page.  It
+				 * doesn't matter how the tuples originated, though.  (In fact
+				 * this must have happened by the time the page gets split,
+				 * since we'll do a final no-op deduplication pass right
+				 * before the page finally splits.  Fortunately that will be
+				 * fairly cheap, since the page will only have three physical
+				 * tuples to consider before we end up breaking out of the
+				 * loop here one last time.)
+				 */
+				Assert(!minimal && pagenitems <= 3);
+				if (pagenitems == 2)
+					_bt_singlevalue_adjust(page, state, newitemsz);
+				else if (pagenitems == 3)
+					break;
+			}
+
+			/*
+			 * Stop deduplicating for a checkingunique (minimal) caller once
+			 * we've freed enough space to avoid an immediate page split
+			 */
+			else if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buf, state,
+											   RelationNeedsWAL(rel));
+		pagenitems++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		offnum = state->skippedbase;
+		pagenitems = 0;
+
+		Assert(state->checkingunique);
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's base tuple.
+ *
+ * Every tuple processed by deduplication either becomes the base tuple for a
+ * posting list, or gets its heap TID(s) accepted into a pending posting list.
+ * A tuple that starts out as the base tuple for a posting list will only
+ * actually be rewritten within _bt_dedup_finish_pending() when it turns out
+ * that there are duplicates that can be merged into the base tuple.
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TID(s) from new base tuple for new candidate posting list
+	 * into working state's array
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* basetupsize should not include existing posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain size of all existing physical tuples (including line
+	 * pointer overhead) so that we can calculate space savings on page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state now
+ * includes itup's heap TID(s).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over maxitemsize limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxitemsize)
+		return false;
+
+	/* Don't merge existing posting lists in first checkingunique pass */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buf, BTDedupState state, bool logged)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buf);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller.)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		IndexTuple	final;
+		Size		finalsz;
+		OffsetNumber offnum;
+		OffsetNumber deletable[MaxOffsetNumber];
+		int			ndeletable = 0;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->phystupsize - (finalsz + sizeof(ItemIdData));
+		/* Must save some space, and must not exceed tuple limits */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+		Assert(finalsz <= state->maxitemsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete original items */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple, replacing original items */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buf);
+
+		/* Log deduplicated items */
+		if (logged)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->baseoff;
+			xlrec_dedup.nitems = state->nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value.
+ *
+ * In the event of a "single value" page, nbtsplitloc.c's single value
+ * strategy should be left with a clean split point as further duplicates are
+ * inserted and successive rightmost page splits occur among pages that store
+ * the same duplicate value.
+ */
+static bool
+_bt_use_singlevalue(Relation rel, Page page, BTDedupState state,
+					OffsetNumber minoff, IndexTuple newitem)
+{
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	ItemId		itemid;
+	IndexTuple	itup;
+
+	itemid = PageGetItemId(page, minoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+
+	if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+	{
+		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Lower maxitemsize when using "single value" strategy, to avoid a third and
+ * final 1/3 of a page sized tuple.
+ *
+ * Called at the point when two large posting list tuples have already been
+ * created/observed.  The third and final posting list tuple should be
+ * somewhat smaller, so that the eventual page split has a useful split point.
+ * Subsequent split should leave the original/left page with a little free
+ * space.  It should be BTREE_SINGLEVAL_FILLFACTOR% full after split, with no
+ * non-pivot tuples left over.
+ *
+ * When there are 3 posting lists on the page, caller should end its
+ * deduplication pass altogether.  Remaining tuples on the page can be
+ * deduplicated later, after the original page splits (i.e. when the new right
+ * sibling page starts to get full itself).
+ */
+static void
+_bt_singlevalue_adjust(Page page, BTDedupState state, Size newitemsz)
+{
+	Size		leftfree;
+
+	/* This calculation needs to match nbtsplitloc.c */
+	leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+	/* Subtract predicted size of new high key */
+	leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+	/*
+	 * Reduce maxitemsize by an amount equal to target free space on left half
+	 * of page
+	 */
+	state->maxitemsize -= leftfree *
+		((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+}
+
+/*
+ * Verify posting list invariants for "posting", which must be a posting list
+ * tuple.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_posting_valid(IndexTuple posting)
+{
+	ItemPointerData last;
+	ItemPointer htid;
+
+	ItemPointerCopy(BTreeTupleGetHeapTID(posting), &last);
+
+	if (!BTreeTupleIsPosting(posting))
+		return false;
+	if (BTreeTupleGetNPosting(posting) < 2)
+		return false;
+	if (!ItemPointerIsValid(&last))
+		return false;
+
+	for (int i = 1; i < BTreeTupleGetNPosting(posting); i++)
+	{
+		htid = BTreeTupleGetPostingN(posting, i);
+
+		if (!ItemPointerIsValid(htid))
+			return false;
+		if (ItemPointerCompare(htid, &last) <= 0)
+			return false;
+		ItemPointerCopy(htid, &last);
+	}
+
+	return true;
+}
+#endif
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in htids array.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * The calculation used when new tuple has a posting list needs to match
+	 * the code used within _bt_dedup_save_htid().
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(_bt_posting_valid(itup));
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+		Assert(ItemPointerIsValid(&itup->t_tid));
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should pass their own private
+ * copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This generally happens inside the same
+ * critical section that performs an in-place update of old posting list using
+ * new posting list returned here).
+ *
+ * While the keys from newitem and oposting must be opclass equal, and must
+ * generate identical output when run through the underlying type's output
+ * function, it doesn't follow that their representations match exactly.
+ * Caller must avoid assuming that there can't be representational differences
+ * that make datums from oposting bigger or smaller than the corresponding
+ * datums from newitem.  For example, differences in TOAST input state might
+ * break a faulty assumption about tuple size (the executor is entitled to
+ * apply TOAST compression based on its own criteria).  It also seems possible
+ * that further representational variation will be introduced in the future,
+ * in order to support nbtree features like page-level prefix compression.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *replaceposright;
+	Size		nmovebytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(_bt_posting_valid(oposting));
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID.  We shift TIDs one place to the right, losing original
+	 * rightmost TID. (nmovebytes must not include TIDs to the left of
+	 * postingoff, nor the existing rightmost/max TID that gets overwritten.)
+	 */
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);
+	nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+	memmove(replaceposright, replacepos, nmovebytes);
+
+	/* Fill the gap at postingoff with TID of new item (original new TID) */
+	Assert(!BTreeTupleIsPivot(newitem) && !BTreeTupleIsPosting(newitem));
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Now copy oposting's rightmost/max TID into new item (final new TID) */
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(oposting), &newitem->t_tid);
+
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(_bt_posting_valid(nposting));
+
+	return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7ddba3ff9f..99a4042d6c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+bool		deduplicate_btree_items = true;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +129,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -295,7 +300,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -348,6 +353,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -369,6 +377,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
+	 *
+	 * Note that each iteration of the loop processes one heap TID, not one
+	 * index tuple.  The page offset number won't be advanced for iterations
+	 * which process heap TIDs from posting list tuples until the last such
+	 * heap TID for the posting list (curposti will be advanced instead).
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	Assert(!itup_key->anynullkeys);
@@ -430,7 +443,28 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+
+				/*
+				 * decide if this is the first heap TID in tuple we'll
+				 * process, or if we should continue to process current
+				 * posting list
+				 */
+				Assert(!BTreeTupleIsPivot(curitup));
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					htid = curitup->t_tid;
+					inposting = false;
+				}
+				else if (!inposting)
+				{
+					/* First heap TID in posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+				}
+
+				if (inposting)
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -506,8 +540,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -565,12 +598,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -584,14 +619,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -616,6 +666,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -684,6 +736,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -699,6 +752,8 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		bool		dedupunique = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -712,9 +767,25 @@ _bt_findinsertloc(Relation rel,
 		 * tuple belongs on.  The heap TID attribute for new tuple (scantid)
 		 * could force us to insert on a sibling page, though that should be
 		 * very rare in practice.
+		 *
+		 * checkingunique inserters that encounter a duplicate will apply
+		 * deduplication when it looks like there will be a page split, but
+		 * there is no LP_DEAD garbage on the leaf page to vacuum away (or
+		 * there wasn't enough space freed by LP_DEAD cleanup).  This
+		 * complements the opportunistic LP_DEAD vacuuming mechanism.  The
+		 * high level goal is to avoid page splits caused by new, unchanged
+		 * versions of existing logical rows altogether.  See nbtree/README
+		 * for full details.
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				dedupunique = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -741,18 +812,37 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (helpful when initial page is empty) */
+				dedupunique = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, try to obtain
+		 * enough free space to avoid a page split by deduplicating existing
+		 * items (if deduplication is safe).
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates if checkingunique */
+				dedupunique = true;
+			}
+
+			if (itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz &&
+				(!checkingunique || dedupunique))
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -834,7 +924,36 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  _bt_insertonpg() cannot handle this, so delete all LP_DEAD
+		 * items early.  This is the only case where LP_DEAD deletes happen
+		 * even though a page split wouldn't take place if we went straight to
+		 * the _bt_insertonpg() call.
+		 *
+		 * Call _bt_dedup_one_page() instead of _bt_vacuum_one_page() to force
+		 * deletes (this avoids relying on the BTP_HAS_GARBAGE hint flag,
+		 * which might be falsely unset).  Call can't actually dedup items,
+		 * since we pass a newitemsz of 0.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, true);
+
+		/*
+		 * Do new binary search, having killed LP_DEAD items.  New insert
+		 * location cannot overlap with any posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -900,10 +1019,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -931,11 +1052,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -949,6 +1074,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -959,6 +1085,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->safededup);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost/max TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -991,7 +1145,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,6 +1221,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1115,8 +1273,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1139,6 +1308,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1147,7 +1317,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1189,6 +1379,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1204,12 +1402,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1229,6 +1439,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1298,6 +1509,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1335,6 +1574,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1368,6 +1609,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1383,6 +1626,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1447,6 +1691,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1475,8 +1720,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1645,8 +1898,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1665,11 +1922,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1829,7 +2110,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2185,6 +2466,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2298,6 +2580,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..53d77ac439 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +674,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +702,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -964,28 +990,88 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with deleting TIDs when some (but not all) of the heap TIDs
+ * in an existing posting list item are to be removed by VACUUM.  This works
+ * by updating/overwriting an existing item with caller's new version of the
+ * item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updatable, IndexTuple *updated,
+					int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	IndexTuple	itup;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen = 0;
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			memcpy(updatedbuf + offset, itup, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber offnum = updatable[i];
+
+		itup = updated[i];
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+
+		if (!PageIndexTupleOverwrite(page, offnum, (Item) itup, itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire physical tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1092,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1107,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatable,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1130,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
 }
 
 /*
@@ -1050,6 +1146,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire physical
+ * tuple can be deleted as a whole (since there is only one LP_DEAD bit per
+ * line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1164,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1211,83 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2233,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8376a5e6b7..fd994d2b1f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+									  int *nremaining);
 
 
 /*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxTIDsPerBTreePage * sizeof(int));
+				if (so->numKilled < MaxTIDsPerBTreePage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1151,11 +1153,16 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		IndexTuple	updated[MaxIndexTuplesPerPage];
+		OffsetNumber updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1187,8 +1194,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1196,11 +1206,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1223,22 +1231,86 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/* Posting list tuple */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All table TIDs from the posting tuple remain, so no
+						 * delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this new tuple and the offset of the tuple
+						 * to be updated for the page's _bt_delitems_vacuum()
+						 * call.
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All table TIDs from the posting list must be
+						 * deleted.  We'll delete the physical index tuple
+						 * completely (no update).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updated[i]);
 		}
 		else
 		{
@@ -1251,6 +1323,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1260,15 +1333,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
-		 * freePages out-of-order (doesn't seem worth any extra code to handle
-		 * the case).
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as separate live tuples).
+		 * We don't delete when recursing, though, to avoid putting entries
+		 * into freePages out-of-order (doesn't seem worth any extra code to
+		 * handle the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1300,9 +1376,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1311,6 +1388,67 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns new palloc'd array of item pointers needed to build
+ * replacement posting list tuple without the TIDs that VACUUM needs to
+ * delete.  Returned value is NULL in the common case no changes are
+ * needed in caller's posting list tuple (we avoid allocating memory
+ * here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is
+ * set for caller in *nremaining.  This indicates the number of elements
+ * in the returned array (assuming that return value isn't just NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple posting, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(posting);
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live table TID.
+			 *
+			 * Only save live TID when we already know that we're going to
+			 * have to kill at least one TID, and have already allocated
+			 * memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live TIDs skipped in previous iterations, if any */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..c8c8ee057d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->safededup);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +693,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * Scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxTIDsPerBTreePage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..94efe37232 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  Even still, the lastleft and firstright
+			 * tuples passed to _bt_truncate() here are at least not fully
+			 * equal to each other when deduplication is used, unless there is
+			 * a large group of duplicates (also, unique index builds usually
+			 * have few or no spool2 duplicates).  When the split point is
+			 * between two unequal "physical" tuples, _bt_truncate() will
+			 * avoid including a heap TID in the new high key, which is the
+			 * most important benefit of suffix truncation.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1007,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1069,43 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	Assert(dstate->nitems > 0);
+
+	if (dstate->nitems == 1)
+		_bt_buildadd(wstate, state, dstate->base, 0);
+	else
+	{
+		IndexTuple	postingtuple;
+		Size		truncextra;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		/* Calculate posting list overhead */
+		truncextra = IndexTupleSize(postingtuple) -
+			BTreeTupleGetPostingOffset(postingtuple);
+
+		_bt_buildadd(wstate, state, postingtuple, truncextra);
+		pfree(postingtuple);
+	}
+
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1151,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1172,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1194,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1293,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1308,100 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->maxitemsize = 0;	/* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to the size of the free
+				 * space we want to leave behind on the page, plus space for
+				 * final item's line pointer (but make sure that posting list
+				 * tuple size won't exceed the generic 1/3 of a page limit).
+				 *
+				 * This is more conservative than the approach taken in the
+				 * retail insert path.  It allows us to get most of the space
+				 * savings deduplication provides while still filling leaf
+				 * pages close to fillfactor% full on average.
+				 */
+				dstate->maxitemsize =
+					Min(Min(BTMaxItemSize(state->btps_page), INDEX_SIZE_MASK),
+						MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+				/* Minimum posting tuple size is 100 bytes (arbitrary) */
+				dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+				dstate->htids = palloc(dstate->maxitemsize);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list.  Heap
+				 * TID from itup has been saved in state.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * _bt_dedup_save_htid() opted to not merge current item into
+				 * pending posting list.
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				pfree(dstate->base);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		if (state)
+		{
+			/*
+			 * Handle the last item (there must be a last item when the
+			 * tuplesort returned one or more tuples)
+			 */
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1409,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..782e3453b7 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplicate_items", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplicate_items)}
 
 	};
 
@@ -2118,11 +2184,10 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
@@ -2138,6 +2203,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2232,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2251,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(firstright))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include the max lastleft
+			 * heap TID in the final pivot tuple, but we can remove the
+			 * posting list now. (Pivot tuples should never contain a posting
+			 * list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2291,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2302,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2320,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2329,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2411,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2392,28 +2492,42 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * Mask allocated for number of keys in index tuple must be able to fit
 	 * maximum possible number of index attributes
 	 */
-	StaticAssertStmt(BT_N_KEYS_OFFSET_MASK >= INDEX_MAX_KEYS,
-					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
+	StaticAssertStmt(BT_OFFSET_MASK >= INDEX_MAX_KEYS,
+					 "BT_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2571,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2602,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2676,53 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the storage parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..92c366373e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,56 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState state;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		state->maxitemsize = Min(BTMaxItemSize(page), INDEX_SIZE_MASK);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = NULL;
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+		state->htids = palloc(state->maxitemsize);
+
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == xlrec->baseoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (!_bt_dedup_save_htid(state, itup))
+				elog(ERROR, "could not add heap tid to pending posting list");
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +533,32 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				Assert(BTreeTupleIsPosting(updated));
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) updated, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +877,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +900,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +928,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..68fad1c91f 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,25 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff, xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +141,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f47176753d..32ff03b3e4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1055,8 +1055,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1142,7 +1144,8 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	}
 
 	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 62285792ec..4a2f43d69c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -1091,6 +1092,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		check_bonjour, NULL, NULL
 	},
+	{
+		{"deduplicate_btree_items", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&deduplicate_btree_items,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
 			gettext_noop("Collects transaction commit time."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 087190ce63..54cdc6322c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -651,6 +651,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#deduplicate_btree_items = on
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2fd88866c9..545bb6189b 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1685,14 +1685,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplicate_items",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplicate_items =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..dede5b51f1 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_plain_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
+		 * bt_posting_plain_tuple() for definition, and details of how posting
+		 * list tuples are handled.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +927,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +958,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1000,20 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * TID, since the posting list itself is validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1025,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is a posting list tuple, make sure posting list TIDs are
+		 * in order
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1091,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1117,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "plain" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_plain_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1150,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1191,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1219,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1230,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1249,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2026,9 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2041,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2106,29 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "plain" tuple for nth posting list entry/TID.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple index tuples are merged together into one equivalent
+ * posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "plain"
+ * tuples.  Each tuple must be fingerprinted separately -- there must be one
+ * tuple for each corresponding Bloom filter probe during the heap scan.
+ *
+ * Note: Caller still needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_plain_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2185,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2193,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2649,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..39e9014128 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,123 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer. Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values) a
+   storage size of about <emphasis>one third</emphasis> of the
+   unoptimized case is expected.  There is often a direct benefit for
+   queries, as well as an indirect benefit due to reduced I/O during
+   routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplicate-items"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   enabled.  The <literal>deduplicate_items</literal> storage
+   parameter can be used to override the configuration paramater for
+   individual indexes.  See <xref
+   linkend="sql-createindex-storage-parameters"/> from the
+   <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with indexes that use B-Tree
+   operator classes that were declared <literal>BITWISE</literal>.  In
+   practice almost all datatypes support deduplication, though
+   <type>numeric</type> is a notable exception (the <quote>display
+   scale</quote> feature makes it impossible to enable deduplication
+   without losing useful information about equal <type>numeric</type>
+   datums).  Deduplication is not supported with nondeterministic
+   collations, nor is it supported with <literal>INCLUDE</literal>
+   indexes.
+  </para>
+  <para>
+   Note that a multicolumn index is only considered to have duplicates
+   when there are index entries that repeat entire
+   <emphasis>combinations</emphasis> of values (the values stored in
+   each and every column must be equal).
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Internal use of Deduplication in unique indexes</title>
+
+  <para>
+   Page splits that occur due to inserting multiple physical versions
+   (rather than inserting new logical rows) tend to degrade the
+   structure of indexes, especially in the case of unique indexes.
+   Unique indexes use deduplication <emphasis>internally</emphasis>
+   and <emphasis>selectively</emphasis> to delay (and ideally to
+   prevent) these <quote>unnecessary</quote> page splits.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but usually cannot
+   reverse page splits (in general, the page must be completely empty
+   before <command>VACUUM</command> can <quote>delete</quote> it).
+  </para>
+  <para>
+   The <xref linkend="guc-btree-deduplicate-items"/> configuration
+   parameter does not affect whether or not deduplication is used
+   within unique indexes.  The internal use of deduplication for
+   unique indexes is subject to all of the same restrictions as
+   deduplication in general.  The <literal>deduplicate_items</literal>
+   storage parameter can be set to <literal>OFF</literal> to disable
+   deduplication in unique indexes, but this is intended only as a
+   debugging option for developers.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..1603f8387b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8021,6 +8021,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplicate-items" xreflabel="deduplicate_btree_items">
+      <term><varname>deduplicate_btree_items</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>deduplicate_btree_items</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls whether deduplication should be used within B-Tree
+        indexes.  Deduplication is an optimization that reduces the
+        storage size of indexes by storing equal index keys only once.
+        See <xref linkend="btree-deduplication"/> for more
+        information.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..6659d15bf4 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,40 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
+    <term><literal>deduplicate_items</literal>
+     <indexterm>
+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref
+      linkend="guc-btree-deduplicate-items"/>.  Controls usage of the
+      B-tree deduplication technique described in <xref
+      linkend="btree-deduplication"/>.  Set to <literal>ON</literal>
+      or <literal>OFF</literal> to override GUC.  (Alternative
+      spellings of <literal>ON</literal> and <literal>OFF</literal>
+      are allowed as described in <xref linkend="config-setting"/>.)
+      The default is <literal>ON</literal>.
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplicate_items</literal> off via
+      <command>ALTER INDEX</command> prevents future insertions from
+      triggering deduplication, but does not in itself make existing
+      posting list tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +478,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..3d353cefdf 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..b0b81b2b9a 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

v29-0002-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/octet-stream; name=v29-0002-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From 67f6b01dfdcc394ca025b357c90fec436df6d59a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v29 2/3] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 118 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   7 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  53 ++++++++
 doc/src/sgml/pageinspect.sgml                 |  83 ++++++------
 4 files changed, 206 insertions(+), 55 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..1b2ea14122 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -510,7 +585,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[8];
+	char	   *values[9];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -557,17 +632,20 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get values of extended metadata if available, use default values
-	 * otherwise.
+	 * otherwise.  Note that we rely on the assumption that btm_safededup is
+	 * initialized to zero on databases that were initdb'd before Postgres 13.
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
+		values[j++] = metad->btm_safededup ? "t" : "f";
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
+		values[j++] = "f";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..92d5c59654 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -12,6 +12,7 @@ fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
+safededup               | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
@@ -41,6 +42,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +58,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..93ea37cde3 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,56 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int4,
+    OUT level int4,
+    OUT fastroot int4,
+    OUT fastlevel int4,
+    OUT oldest_xact int4,
+    OUT last_cleanup_num_tuples real,
+    OUT safededup boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..b527daf6ca 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -300,13 +300,14 @@ test=# SELECT t_ctid, raw_flags, combined_flags
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 582
 last_cleanup_num_tuples | 1000
+safededup               | f
 </screen>
      </para>
     </listitem>
@@ -329,11 +330,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +357,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +415,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v29-0003-DEBUG-Show-index-values-in-pageinspect.patchapplication/octet-stream; name=v29-0003-DEBUG-Show-index-values-in-pageinspect.patchDownload

From 700a3de4b2804fe13a6bbe3e6340059c1744850f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v29 3/3] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 65 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 1b2ea14122..fc1252455d 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +263,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 92d5c59654..fc6794ef65 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,7 +41,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

#129

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#128)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

* HEIKKI: Do we only generate one posting list in one WAL record? I
would assume it's better to deduplicate everything on the page, since
we're modifying it anyway.

Still thinking about this one.

* HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the
remaining posting list on an updated tuple? Wouldn't it be simpler and
more space-efficient to store just the deleted TIDs?

This probably makes sense. The btreevacuumposting() code that
generates "updated" index tuples (tuples that VACUUM uses to replace
existing ones when some but not all of the TIDs need to be removed)
was derived from GIN's ginVacuumItemPointers(). That approach works
well enough, but we can do better now. It shouldn't be that hard.

My preferred approach is a little different to your suggested approach
of storing the deleted TIDs directly. I would like to make it work by
storing an array of uint16 offsets into a posting list, one array per
"updated" tuple (with one offset per deleted TID within each array).
These arrays (which must include an array size indicator at the start)
can appear in the xl_btree_vacuum record, at the same place the patch
currently puts a raw IndexTuple. They'd be equivalent to a raw
IndexTuple -- the REDO routine would reconstruct the same raw posting
list tuple on its own. This approach seems simpler, and is clearly
very space efficient.

This approach is similar to the approach used by REDO routines to
handle posting list splits. Posting list splits must call
_bt_swap_posting() on the primary, while the corresponding REDO
routines also call _bt_swap_posting(). For space efficient "updates",
we'd have to invent a sibling utility function -- we could call it
_bt_delete_posting(), and put it next to _bt_swap_posting() within
nbtdedup.c.

How do you feel about that approach? (And how do you feel about the
existing "REDO routines call _bt_swap_posting()" business that it's
based on?)

* HEIKKI: Would it be more clear to have a separate struct for the
posting list split case? (i.e. don't reuse xl_btree_insert)

I've concluded that this one probably isn't worthwhile. We'd have to
carry a totally separate record on the stack within _bt_insertonpg().
If you feel strongly about it, I will reconsider.

--
Peter Geoghegan

#130

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#128)

5 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

Still, v29 doesn't resolve the following points you've raised, where I
haven't reached a final opinion on what to do myself. These items are
as follows (I'm quoting your modified patch file sent on January 8th
here):

Still no progress on these items, but I am now posting v30. A new
version seems warranted, because I now want to revive a patch from a
couple of years back as part of the deduplication project -- it would
be good to get feedback on that sooner rather than later. This is a
patch that you [Heikki] are already familiar with -- the patch to
speed up compactify_tuples() [1]https://commitfest.postgresql.org/14/1138/ -- Peter Geoghegan. Sokolov Yura is CC'd here, since he
is the original author.

The deduplication patch is much faster with this in place. For
example, with v30:

pg@regression:5432 [25216]=# create unlogged table foo(bar int4);
CREATE TABLE
pg@regression:5432 [25216]=# create index unlogged_foo_idx on foo(bar);
CREATE INDEX
pg@regression:5432 [25216]=# insert into foo select g from
generate_series(1, 1000000) g, generate_series(1,10) i;
INSERT 0 10000000
Time: 17842.455 ms (00:17.842)

If I revert the "Bucket sort for compactify_tuples" commit locally,
then the same insert statement takes 31.614 seconds! In other words,
the insert statement is made ~77% faster by that commit alone. The
improvement is stable and reproducible.

Clearly there is a big compactify_tuples() bottleneck that comes from
PageIndexMultiDelete(). The hot spot is quite visible with "perf top
-e branch-misses".

The compactify_tuples() patch stalled because it wasn't clear if it
was worth the trouble at the time. It was originally written to
address a much smaller PageRepairFragmentation() bottleneck in heap
pruning. ISTM that deduplication alone is a good enough reason to
commit this patch. I haven't really changed anything about the
2017/2018 patch -- I need to do more review of that. We probably don't
need to qsort() inlining stuff (the bucket sort thing is the real
win), but I included it in v30 all the same.

Other changes in v30:

* We now avoid extra _bt_compare() calls within _bt_check_unique() --
no need to call _bt_compare() once per TID (once per equal tuple is
quite enough).

This is a noticeable performance win, even though the change was
originally intended to make the logic in _bt_check_unique() clearer.

* Reduced the limit on the size of a posting list tuple to 1/6 of a
page -- down from 1/3.

This seems like a good idea on the grounds that it keeps our options
open if we split a page full of duplicates due to UPDATEs rather than
INSERTs (i.e. we split a page full of duplicates that isn't also the
rightmost page among pages that store only those duplicates). A lower
limit is more conservative, and yet doesn't cost us that much space.

* Refined nbtsort.c/CREATE INDEX to work sensibly with non-standard
fillfactor settings.

This last item is a minor bugfix, really.

[1]: https://commitfest.postgresql.org/14/1138/ -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v30-0005-DEBUG-Show-index-values-in-pageinspect.patchapplication/octet-stream; name=v30-0005-DEBUG-Show-index-values-in-pageinspect.patchDownload

From 52c62ce2dbe65a7498cd8128cc162a4d58312a95 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v30 5/5] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 65 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 1b2ea14122..fc1252455d 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +263,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 92d5c59654..fc6794ef65 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,7 +41,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v30-0004-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/octet-stream; name=v30-0004-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From b369cf5e2a91ef01b2e7b6325350a6c113bf5e71 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v30 4/5] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 118 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   7 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  53 ++++++++
 doc/src/sgml/pageinspect.sgml                 |  83 ++++++------
 4 files changed, 206 insertions(+), 55 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..1b2ea14122 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -510,7 +585,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[8];
+	char	   *values[9];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -557,17 +632,20 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get values of extended metadata if available, use default values
-	 * otherwise.
+	 * otherwise.  Note that we rely on the assumption that btm_safededup is
+	 * initialized to zero on databases that were initdb'd before Postgres 13.
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
+		values[j++] = metad->btm_safededup ? "t" : "f";
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
+		values[j++] = "f";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..92d5c59654 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -12,6 +12,7 @@ fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
+safededup               | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
@@ -41,6 +42,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +58,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..93ea37cde3 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,56 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int4,
+    OUT level int4,
+    OUT fastroot int4,
+    OUT fastlevel int4,
+    OUT oldest_xact int4,
+    OUT last_cleanup_num_tuples real,
+    OUT safededup boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..b527daf6ca 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -300,13 +300,14 @@ test=# SELECT t_ctid, raw_flags, combined_flags
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 582
 last_cleanup_num_tuples | 1000
+safededup               | f
 </screen>
      </para>
     </listitem>
@@ -329,11 +330,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +357,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +415,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v30-0003-Bucket-sort-for-compactify_tuples.patchapplication/octet-stream; name=v30-0003-Bucket-sort-for-compactify_tuples.patchDownload

From 64c03b2040eefa8164447534bf1c3dd5d5cbb320 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 14 Jan 2020 15:27:54 -0800
Subject: [PATCH v30 3/5] Bucket sort for compactify_tuples.

Original commit message from Sokolov Yura:

This patch implements bucket sort for compactify_tuples:
- one pass of stable prefix sort on high 8 bits of offset
- and insertion sort for buckets larger than 1 element

This approach allows to save 3% of cpu in degenerate case
(highly intensive HOT random updates on unlogged table with
 synchronized_commit=off), and speeds up WAL replaying (as were
found by Heikki Linnakangas).

Same approach were implemented by Heikki Linnakangas some time ago with
several distinctions.

This patch was retrieved from:
https://postgr.es/m/CAL-rCA2n7UfVu1Ui0f%2B7cVN4vAKVM0%2B-cZKb_ka6-mGQBAF92w%40mail.gmail.com

CF Entry for patch: https://commitfest.postgresql.org/14/1138/

Related thread: https://postgr.es/m/546B89DE.7030906%40vmware.com
---
 src/backend/storage/page/bufpage.c | 80 +++++++++++++++++++++++++++++-
 1 file changed, 78 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 32ff03b3e4..0081dd921b 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -436,6 +436,79 @@ itemoffcompare(const void *itemidp1, const void *itemidp2)
 		((itemIdSort) itemidp1)->itemoff;
 }
 
+#define QS_SUFFIX itemIds
+#define QS_TYPE itemIdSortData
+#define QS_SCOPE static
+#define QS_CMP itemoffcompare
+#define QS_DEFINE
+#define QS_SKIP_MED3
+#include "lib/qsort_template.h"
+
+/*
+ * Sort an array of itemIdSort's on itemoff, descending.
+ *
+ * It uses bucket sort:
+ * - single pass of stable prefix sort on high 8 bits
+ * - and insertion sort on buckets larger than 1 element
+ */
+static void
+bucketsort_itemIds(itemIdSort itemidbase, int nitems)
+{
+	itemIdSortData copy[Max(MaxIndexTuplesPerPage, MaxHeapTuplesPerPage)];
+#define NSPLIT 256
+#define PREFDIV (BLCKSZ / NSPLIT)
+	/* two extra elements to emulate offset on previous step */
+	uint16		count[NSPLIT + 2] = {0};
+	int			i,
+				max,
+				total,
+				pos,
+				highbits;
+
+	Assert(nitems <= MaxIndexTuplesPerPage);
+	for (i = 0; i < nitems; i++)
+	{
+		highbits = itemidbase[i].itemoff / PREFDIV;
+		count[highbits]++;
+	}
+	/* sort in decreasing order */
+	max = total = count[NSPLIT - 1];
+	for (i = NSPLIT - 2; i >= 0; i--)
+	{
+		max |= count[i];
+		total += count[i];
+		count[i] = total;
+	}
+
+	/*
+	 * count[k+1] is start of bucket k, count[k] is end of bucket k, and
+	 * count[k] - count[k+1] is length of bucket k.
+	 */
+	Assert(count[0] == nitems);
+	for (i = 0; i < nitems; i++)
+	{
+		highbits = itemidbase[i].itemoff / PREFDIV;
+		pos = count[highbits + 1];
+		count[highbits + 1]++;
+		copy[pos] = itemidbase[i];
+	}
+	Assert(count[1] == nitems);
+
+	if (max > 1)
+	{
+		/*
+		 * count[k+2] is start of bucket k, count[k+1] is end of bucket k, and
+		 * count[k+1]-count[k+2] is length of bucket k.
+		 */
+		for (i = NSPLIT; i > 0; i--)
+		{
+			insertion_sort_itemIds(copy + count[i + 1], count[i] - count[i + 1]);
+		}
+	}
+
+	memcpy(itemidbase, copy, sizeof(itemIdSortData) * nitems);
+}
+
 /*
  * After removing or marking some line pointers unused, move the tuples to
  * remove the gaps caused by the removed items.
@@ -448,8 +521,11 @@ compactify_tuples(itemIdSort itemidbase, int nitems, Page page)
 	int			i;
 
 	/* sort itemIdSortData array into decreasing itemoff order */
-	qsort((char *) itemidbase, nitems, sizeof(itemIdSortData),
-		  itemoffcompare);
+	if (nitems > 48)
+		bucketsort_itemIds(itemidbase, nitems);
+	else
+		qsort_itemIds(itemidbase, nitems);
+
 
 	upper = phdr->pd_special;
 	for (i = 0; i < nitems; i++)
-- 
2.17.1

v30-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v30-0001-Add-deduplication-to-nbtree.patchDownload

From 5d9713c7820dba084e036aa8c00cf9273a57ff38 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v30 1/5] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
"posting list tuples" are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.

The lazy approach taken by nbtree has significant advantages over a
GIN style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.
Furthermore, deduplication can easily be turned on or off.  The split
point choice logic doesn't need to be changed, since posting list tuples
are just tuples with payload, much like tuples with non-key columns in
INCLUDE indexes.  (nbtsplitloc.c is still optimized to make intelligent
choices in the presence of posting list tuples, though only because
suffix truncation will routinely make new high keys far far smaller than
the non-pivot tuple they're derived from).

In general, nbtree unique indexes sometimes need to store multiple equal
(non-NULL) tuples for the same logical row (one per physical row
version).  Unique indexes can use deduplication specifically to merge
together multiple physical versions (index tuples), though the overall
strategy used there is somewhat different.  The high-level goal with
unique indexes is to prevent "unnecessary" page splits -- splits caused
only by a short term burst of index tuple versions.  This is often a
concern with frequently updated tables where UPDATEs always modify at
least one indexed column (making it impossible for the table am to use
an optimization like heapam's heap-only tuples optimization).
Deduplication in unique indexes effectively "buys time" for existing
nbtree garbage collection mechanisms to run and prevent these page
splits (the LP_DEAD bit setting performed during the uniqueness check is
the most important mechanism for controlling bloat with affected
workloads).

Deduplication in non-unique indexes is controlled by a new GUC,
deduplicate_btree_items.  A new storage parameter (deduplicate_items) is
also added, which controls deduplication at the index relation
granularity.  It can be used to disable deduplication in unique indexes
for debugging purposes. (The general criteria for applying deduplication
in unique indexes ensures that only cases with some duplicates will
actually get a deduplication pass -- that's why unique indexes are not
affected by the deduplicate_btree_items GUC.)

Since posting list tuples have only one line pointer (just like any
other tuple), they have only one LP_DEAD bit.  The LP_DEAD bit can still
be set by both unique checking and the kill_prior_tuple optimization,
but only when all heap TIDs are dead-to-all.  This "loss of granularity"
for LP_DEAD bits is considered an acceptable downside of the
deduplication design.  We always prefer deleting LP_DEAD items to a
deduplication pass, and a deduplication pass can only take place at the
point where we'd previously have had to split the page, so any workload
that pays a cost here must also get a significant benefit.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since deduplication only affects the physical
representation of tuples.  However, users must still REINDEX a
pg_upgrade'd index to before its leaf page splits will apply
deduplication.  An index build is the only way to set the new nbtree
metapage flag indicating that deduplication is generally safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h                   | 407 +++++++--
 src/include/access/nbtxlog.h                  |  96 ++-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 151 +++-
 src/backend/access/nbtree/nbtdedup.c          | 809 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 397 +++++++--
 src/backend/access/nbtree/nbtpage.c           | 223 ++++-
 src/backend/access/nbtree/nbtree.c            | 180 +++-
 src/backend/access/nbtree/nbtsearch.c         | 271 +++++-
 src/backend/access/nbtree/nbtsort.c           | 190 +++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 226 ++++-
 src/backend/access/nbtree/nbtxlog.c           | 202 ++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  23 +-
 src/backend/storage/page/bufpage.c            |   9 +-
 src/backend/utils/misc/guc.c                  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 231 ++++-
 doc/src/sgml/btree.sgml                       | 120 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  25 +
 doc/src/sgml/ref/create_index.sgml            |  38 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 28 files changed, 3390 insertions(+), 320 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 20ace69dab..036a9d9e97 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,9 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* GUC parameter */
+extern bool deduplicate_btree_items;
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,6 +111,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -124,6 +128,13 @@ typedef struct BTMetaPageData
  * need to be immediately re-indexed at pg_upgrade.  In order to get the
  * new heapkeyspace semantics, however, a REINDEX is needed.
  *
+ * Deduplication is safe to use when the btm_safededup field is set to
+ * true.  It's safe to read the btm_safededup field on version 3, but only
+ * version 4 indexes make use of deduplication.  Even version 4 indexes
+ * created on PostgreSQL v12 will need a REINDEX to make use of
+ * deduplication, though, since there is no other way to set btm_safededup
+ * to true (pg_upgrade hasn't been taught to set the metapage field).
+ *
  * Btree version 2 is mostly the same as version 3.  There are two new
  * fields in the metapage that were introduced in version 3.  A version 2
  * metapage will be automatically upgraded to version 3 on the first
@@ -156,6 +167,21 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
+ * that may be stored on a btree leaf page.  It is used to size the
+ * per-page temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-tuple overheads here to keep
+ * things simple (value is based on how many elements a single array of
+ * heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.
+ */
+#define MaxTIDsPerBTreePage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +256,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -264,7 +289,8 @@ typedef struct BTMetaPageData
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
  * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
- * t_info is always set on BTREE_VERSION 4 pivot tuples.
+ * t_info is always set on BTREE_VERSION 4 pivot tuples, since
+ * BTreeTupleIsPivot() must work reliably on heapkeyspace versions.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -279,90 +305,256 @@ typedef struct BTMetaPageData
  * The 12 least significant offset bits from t_tid are used to represent
  * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
- * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
- * number of columns/attributes <= INDEX_MAX_KEYS.
+ * future use.  BT_OFFSET_MASK should be large enough to store any number
+ * of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  PostgreSQL v13 introduced a
+ * new non-pivot tuple format to support deduplication: posting list
+ * tuples.  Deduplication merges together multiple equal non-pivot tuples
+ * into a logically equivalent, space efficient representation.  A posting
+ * list is an array of ItemPointerData elements.  Non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).  BT_OFFSET_MASK should be large enough to store
+ * any number of posting list TIDs that might be present in a tuple (since
+ * tuple size is subject to the INDEX_SIZE_MASK limit).
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
-#define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_OFFSET_MASK				0x0FFF
 #define BT_PIVOT_HEAP_TID_ATTR		0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined as a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
-			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_PIVOT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_PIVOT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -434,6 +626,9 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use deduplication safely.
+ * This is also a property of the index relation rather than an indexscan.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +664,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -507,10 +703,59 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal items
+ * from the page, starting from page offset number 'baseoff'.  This is the
+ * offset number of the "base" tuple for new posting list.  'nitems' is the
+ * current total number of existing items from the page that will be merged to
+ * make a new posting list tuple, including the base tuple item.  (Existing
+ * items may themselves be posting list tuples, or regular non-pivot tuples.)
+ *
+ * Note that when deduplication merges together existing tuples, the page is
+ * modified eagerly.  This makes tracking the details of more than a single
+ * pending posting list at a time unnecessary.  The total size of the existing
+ * tuples to be freed when pending posting list is processed gets tracked by
+ * 'phystupsize'.  This information allows deduplication to calculate the
+ * space saving for each new posting list tuple, and for the entire pass over
+ * the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	Size		maxpostingsize; /* Limit on size of final tuple */
+	bool		checkingunique; /* Use unique index strategy? */
+	OffsetNumber skippedbase;	/* First offset skipped by checkingunique */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without original posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing tuples/line pointers */
+	Size		phystupsize;	/* Includes line pointer overhead */
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -534,7 +779,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list
+ * tuple.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -578,7 +825,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -686,6 +933,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplicate_items;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -694,8 +942,16 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplicate_items : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique || deduplicate_btree_items)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -742,6 +998,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buf, BTDedupState state,
+									 bool logged);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -760,14 +1032,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -776,7 +1050,9 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updatable, IndexTuple *updated,
+								int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
@@ -829,6 +1105,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..bad8da4b30 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +154,32 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posting tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nitems) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * the TIDs within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -152,19 +212,23 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
- *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are how deletes of a subset of TIDs contained in an
+ * existing posting list tuple are implemented. (Updates are only used when
+ * there will be some remaining TIDs once VACUUM finishes; otherwise the
+ * posting list tuple can just be deleted).
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES FOR OVERWRITES FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +309,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..f2b03a6cfc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplicate_items",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..f2673328b4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every table TID within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,152 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.
+
+Like all index access methods, nbtree does not have direct knowledge of
+versioning or of MVCC; it deals only with physical tuples.  However, unique
+indexes implicitly give nbtree basic information about tuple versioning,
+since by definition zero or one tuples of any given key value can be
+visible to any possible MVCC snapshot (excluding index entries with NULL
+values).  When optimizations such as heapam's Heap-only tuples (HOT) happen
+to be ineffective, nbtree's on-the-fly deletion of tuples in unique indexes
+can be very important with UPDATE-heavy workloads.  Unique checking's
+LP_DEAD bit setting reliably attempts to kill old, equal index tuple
+versions.  This prevents (or at least delays) page splits that are
+necessary only because a leaf page must contain multiple physical tuples
+for the same logical row.  Deduplication in unique indexes must cooperate
+with this mechanism.  Deleting items on the page is always preferable to
+deduplication.
+
+The strategy used during a deduplication pass has significant differences
+to the strategy used for indexes that can have multiple logical rows with
+the same key value.  We're not really trying to store duplicates in a
+space efficient manner, since in the long run there won't be any
+duplicates anyway.  Rather, we're buying time for garbage collection
+mechanisms to run before a page split is needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves a situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost/max
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..c326a8c666
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,809 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+							 OffsetNumber minoff, IndexTuple newitem);
+static void _bt_singleval_fillfactor(Page page, BTDedupState state,
+									 Size newitemsz);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_posting_valid(IndexTuple posting);
+#endif
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The general approach taken with !checkingunique callers is to perform as
+ * much deduplication as possible to free as much space as possible now.  Note
+ * that "single value" strategy is sometimes used.  This maximizes space
+ * utilization over time given a workload where many leaf pages are needed to
+ * store duplicates that all have the same duplicate value (e.g., a single
+ * column index where a significant fraction of all tuples are NULLs).
+ *
+ * The strategy for checkingunique callers is completely different.
+ * Deduplication works in tandem with garbage collection, especially the
+ * LP_DEAD bit setting that takes place in _bt_check_unique().  We give up as
+ * soon as it becomes clear that enough space has been made available to
+ * insert newitem without needing to split the page.  Also, we merge together
+ * larger groups of duplicate tuples first (merging together two index tuples
+ * usually saves very little space), and avoid merging together existing
+ * posting list tuples.  The goal is to generate posting lists with TIDs that
+ * are "close together in time", in order to maximize the chances of an
+ * LP_DEAD bit being set opportunistically.  See nbtree/README for more
+ * information on deduplication within unique indexes.
+ *
+ * Note that unique indexes will use the !checkingunique strategy when an
+ * insert of a tuple with NULLs causes a deduplication pass.  This case is
+ * still not affected by the deduplicate_btree_items GUC, since unique indexes
+ * never use it, and it doesn't seem worth creating a special case for.
+ *
+ * nbtinsert.c caller should call _bt_vacuum_one_page() before calling here
+ * when BTP_HAS_GARBAGE flag is set.  Note that this routine will delete all
+ * items on the page that have their LP_DEAD bit set, even when page's flag
+ * bit is not set (though that should be rare).  Caller can rely on that to
+ * avoid inserting a new tuple that happens to overlap with an existing
+ * posting list tuple with its LP_DEAD bit set. (Calling here with a newitemsz
+ * of 0 will reliably delete the existing item, making it possible to avoid
+ * unsetting the LP_DEAD bit just to insert the new item.  In general, posting
+ * list splits should never have to deal with a posting list tuple with its
+ * LP_DEAD bit set.)
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index.  The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	bool		minimal = checkingunique;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	int			pagenitems = 0;
+	bool		singlevalstrat = false;
+
+	/*
+	 * Caller should call _bt_vacuum_one_page() before calling here when it
+	 * looked like there were LP_DEAD items on the page.  However, we can't
+	 * assume that there are no LP_DEAD items (for one thing, VACUUM will
+	 * clear the BTP_HAS_GARBAGE hint without reliably removing items that are
+	 * marked LP_DEAD).  We must be careful to clear all LP_DEAD items because
+	 * posting list splits cannot go ahead if an existing posting list item
+	 * has its LP_DEAD bit set. (Also, we don't want to unnecessarily unset
+	 * LP_DEAD bits when deduplicating items on the page below, though that
+	 * should be harmless.)
+	 *
+	 * The opposite problem is also possible: _bt_vacuum_one_page() won't
+	 * clear the BTP_HAS_GARBAGE bit when it is falsely set (i.e. when there
+	 * are no LP_DEAD bits).  This probably doesn't matter in practice, since
+	 * it's only a hint, and VACUUM will clear it at some point anyway.  Even
+	 * still, we clear the BTP_HAS_GARBAGE hint reliably here. (Seems like a
+	 * good idea for deduplication to only begin when we unambiguously have no
+	 * LP_DEAD items.)
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split using the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+	else if (P_HAS_GARBAGE(opaque))
+	{
+		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(buf, true);
+	}
+
+	/*
+	 * Return early in case where caller just wants us to kill an existing
+	 * LP_DEAD posting list tuple
+	 */
+	Assert(!P_HAS_GARBAGE(opaque));
+	if (newitemsz == 0)
+		return;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 *
+	 * It would be possible for maxpostingsize (limit on posting list tuple
+	 * size) to be set to one third of the page.  However, it seems like a
+	 * good idea to limit the size of posting lists to one sixth of a page.
+	 * That ought to leave us with a good split point when pages full of
+	 * duplicates can be split several times.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+	state->checkingunique = checkingunique;
+	state->skippedbase = InvalidOffsetNumber;
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = palloc(state->maxpostingsize);
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+
+	/* Determine if "single value" strategy should be used */
+	if (!checkingunique)
+		singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+	offnum = minoff;
+retry:
+
+	/*
+	 * Deduplicate items, starting from offnum.
+	 *
+	 * Note: We deliberately reassess the max offset number on each iteration.
+	 * The number of items on the page goes down as existing items are
+	 * deduplicated.
+	 */
+	while (offnum <= PageGetMaxOffsetNumber(page))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (state->nitems == 0)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed current
+			 * maxpostingsize).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and actually update the page.  Else
+			 * reset the state and move on without modifying the page.
+			 */
+			pagesaving += _bt_dedup_finish_pending(buf, state,
+												   RelationNeedsWAL(rel));
+			pagenitems++;
+
+			if (singlevalstrat)
+			{
+				/*
+				 * Single value strategy's extra steps.
+				 *
+				 * Lower maxpostingsize for sixth and final item that might be
+				 * deduplicated by current deduplication pass.  When sixth
+				 * item formed/observed, end pass.
+				 *
+				 * Note: It's possible that this will be reached even when
+				 * current deduplication pass has yet to modify the page.  It
+				 * doesn't matter whether or not the current call generated
+				 * the maxpostingsize-capped duplicate tuples at the start of
+				 * the page.
+				 */
+				Assert(!minimal && pagenitems <= 6);
+				if (pagenitems == 5)
+					_bt_singleval_fillfactor(page, state, newitemsz);
+				else if (pagenitems == 6)
+					break;
+			}
+
+			/*
+			 * Stop deduplicating for a checkingunique (minimal) caller once
+			 * we've freed enough space to avoid an immediate page split
+			 */
+			else if (minimal && pagesaving >= newitemsz)
+				break;
+
+			/*
+			 * Next iteration starts immediately after base tuple offset (this
+			 * will be the next offset on the page when we didn't modify the
+			 * page)
+			 */
+			offnum = state->baseoff;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/* Handle the last item when pending posting list is not empty */
+	if (state->nitems != 0)
+	{
+		pagesaving += _bt_dedup_finish_pending(buf, state,
+											   RelationNeedsWAL(rel));
+		pagenitems++;
+	}
+
+	if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+	{
+		/*
+		 * Didn't free enough space for new item in first checkingunique pass.
+		 * Try making a second pass over the page, this time starting from the
+		 * first candidate posting list base offset that was skipped over in
+		 * the first pass (only do a second pass when this actually happened).
+		 *
+		 * The second pass over the page may deduplicate items that were
+		 * initially passed over due to concerns about limiting the
+		 * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+		 * Note that the second pass will still stop deduplicating as soon as
+		 * enough space has been freed to avoid an immediate page split.
+		 */
+		offnum = state->skippedbase;
+		pagenitems = 0;
+
+		Assert(state->checkingunique);
+		state->checkingunique = false;
+		state->skippedbase = InvalidOffsetNumber;
+
+		goto retry;
+	}
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's base tuple.
+ *
+ * Every tuple processed by deduplication either becomes the base tuple for a
+ * posting list, or gets its heap TID(s) accepted into a pending posting list.
+ * A tuple that starts out as the base tuple for a posting list will only
+ * actually be rewritten within _bt_dedup_finish_pending() when it turns out
+ * that there are duplicates that can be merged into the base tuple.
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TID(s) from new base tuple for new candidate posting list
+	 * into working state's array
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, base, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* basetupsize should not include existing posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain physical size of all existing tuples (including line
+	 * pointer overhead) so that we can calculate space savings on page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state now
+ * includes itup's heap TID(s).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over maxpostingsize limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxpostingsize)
+		return false;
+
+	/* Don't merge existing posting lists in first checkingunique pass */
+	if (state->checkingunique &&
+		(BTreeTupleIsPosting(state->base) || nhtids > 1))
+	{
+		/* May begin here if second pass over page is required */
+		if (state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+		return false;
+	}
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buf, BTDedupState state, bool logged)
+{
+	Size		spacesaving = 0;
+	Page		page = BufferGetPage(buf);
+	int			minimum = 2;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+
+	/*
+	 * Only create a posting list when at least 3 heap TIDs will appear in the
+	 * checkingunique case (checkingunique strategy won't merge existing
+	 * posting list tuples, so we know that the number of items here must also
+	 * be the total number of heap TIDs).  Creating a new posting lists with
+	 * only two heap TIDs won't even save enough space to fit another
+	 * duplicate with the same key as the posting list.  This is a bad
+	 * trade-off if there is a chance that the LP_DEAD bit can be set for
+	 * either existing tuple by putting off deduplication.
+	 *
+	 * (Note that a second pass over the page can deduplicate the item if that
+	 * is truly the only way to avoid a page split for checkingunique caller.)
+	 */
+	Assert(!state->checkingunique || state->nitems == 1 ||
+		   state->nhtids == state->nitems);
+	if (state->checkingunique)
+	{
+		minimum = 3;
+		/* May begin here if second pass over page is required */
+		if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+			state->skippedbase = state->baseoff;
+	}
+
+	if (state->nitems >= minimum)
+	{
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
+		int			ndeletable = 0;
+		OffsetNumber offnum;
+		IndexTuple	final;
+		Size		finalsz;
+
+		/* find all tuples that will be replaced with this new posting tuple */
+		for (offnum = state->baseoff;
+			 offnum < state->baseoff + state->nitems;
+			 offnum = OffsetNumberNext(offnum))
+			deletable[ndeletable++] = offnum;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		spacesaving = state->phystupsize - (finalsz + sizeof(ItemIdData));
+		/* Must save some space, and must not exceed tuple limits */
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+		Assert(finalsz <= state->maxpostingsize);
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+		START_CRIT_SECTION();
+
+		/* Delete original items */
+		PageIndexMultiDelete(page, deletable, ndeletable);
+		/* Insert posting tuple, replacing original items */
+		if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		MarkBufferDirty(buf);
+
+		/* Log deduplicated items */
+		if (logged)
+		{
+			XLogRecPtr	recptr;
+			xl_btree_dedup xlrec_dedup;
+
+			xlrec_dedup.baseoff = state->baseoff;
+			xlrec_dedup.nitems = state->nitems;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
+		pfree(final);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied.  The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value.  When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split.  The first deduplication pass should only
+ * find regular non-pivot tuples.  Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over.  The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples.  The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive.  Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes.  Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ *
+ * Note: We deliberately don't bother checking if the high key is a distinct
+ * value (prior to the TID tiebreaker column) before proceeding, unlike
+ * nbtsplitloc.c.  Its single value strategy only gets applied on the
+ * rightmost page of duplicates of the same value (other leaf pages full of
+ * duplicates will get a simple 50:50 page split instead of splitting towards
+ * the end of the page).  There is little point in making the same distinction
+ * here.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+				 OffsetNumber minoff, IndexTuple newitem)
+{
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	ItemId		itemid;
+	IndexTuple	itup;
+
+	itemid = PageGetItemId(page, minoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+
+	if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+	{
+		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Lower maxpostingsize when using "single value" strategy, to avoid a sixth
+ * and final maxpostingsize-capped tuple.  The sixth and final posting list
+ * tuple will end up somewhat smaller than the first five.  (Note: The first
+ * five tuples could actually just be very large duplicate tuples that
+ * couldn't be merged together at all.  Deduplication will simply not modify
+ * the page when that happens.)
+ *
+ * When there are six posting lists on the page (after current deduplication
+ * pass goes on to create/observe a sixth very large tuple), caller should end
+ * its deduplication pass.  It isn't useful to try to deduplicate items that
+ * are supposed to end up on the new right sibling page following the
+ * anticipated page split.  A future deduplication pass of future right
+ * sibling page might take care of it.  (This is why the first single value
+ * strategy deduplication pass for a given leaf page will generally find only
+ * plain non-pivot tuples -- see _bt_do_singleval() comments.)
+ */
+static void
+_bt_singleval_fillfactor(Page page, BTDedupState state, Size newitemsz)
+{
+	Size		leftfree;
+	int			reduction;
+
+	/* This calculation needs to match nbtsplitloc.c */
+	leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+	/* Subtract size of new high key (includes pivot heap TID space) */
+	leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+	/*
+	 * Reduce maxpostingsize by an amount equal to target free space on left
+	 * half of page
+	 */
+	reduction = leftfree * ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+	if (state->maxpostingsize > reduction)
+		state->maxpostingsize -= reduction;
+	else
+		state->maxpostingsize = 0;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in htids array.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * The calculation used when new tuple has a posting list needs to match
+	 * the code used within _bt_dedup_save_htid().
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(_bt_posting_valid(itup));
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+		Assert(ItemPointerIsValid(&itup->t_tid));
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should pass their own private
+ * copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This generally happens inside the same
+ * critical section that performs an in-place update of old posting list using
+ * new posting list returned here).
+ *
+ * While the keys from newitem and oposting must be opclass equal, and must
+ * generate identical output when run through the underlying type's output
+ * function, it doesn't follow that their representations match exactly.
+ * Caller must avoid assuming that there can't be representational differences
+ * that make datums from oposting bigger or smaller than the corresponding
+ * datums from newitem.  For example, differences in TOAST input state might
+ * break a faulty assumption about tuple size (the executor is entitled to
+ * apply TOAST compression based on its own criteria).  It also seems possible
+ * that further representational variation will be introduced in the future,
+ * in order to support nbtree features like page-level prefix compression.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *replaceposright;
+	Size		nmovebytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(_bt_posting_valid(oposting));
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID.  We shift TIDs one place to the right, losing original
+	 * rightmost TID. (nmovebytes must not include TIDs to the left of
+	 * postingoff, nor the existing rightmost/max TID that gets overwritten.)
+	 */
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);
+	nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+	memmove(replaceposright, replacepos, nmovebytes);
+
+	/* Fill the gap at postingoff with TID of new item (original new TID) */
+	Assert(!BTreeTupleIsPivot(newitem) && !BTreeTupleIsPosting(newitem));
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Now copy oposting's rightmost/max TID into new item (final new TID) */
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(oposting), &newitem->t_tid);
+
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(_bt_posting_valid(nposting));
+
+	return nposting;
+}
+
+/*
+ * Verify posting list invariants for "posting", which must be a posting list
+ * tuple.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_posting_valid(IndexTuple posting)
+{
+	ItemPointerData last;
+	ItemPointer htid;
+
+	if (!BTreeTupleIsPosting(posting) || BTreeTupleGetNPosting(posting) < 2)
+		return false;
+
+	/* Remember first heap TID for loop */
+	ItemPointerCopy(BTreeTupleGetHeapTID(posting), &last);
+	if (!ItemPointerIsValid(&last))
+		return false;
+
+	/* Iterate, starting from second TID */
+	for (int i = 1; i < BTreeTupleGetNPosting(posting); i++)
+	{
+		htid = BTreeTupleGetPostingN(posting, i);
+
+		if (!ItemPointerIsValid(htid))
+			return false;
+		if (ItemPointerCompare(htid, &last) <= 0)
+			return false;
+		ItemPointerCopy(htid, &last);
+	}
+
+	return true;
+}
+#endif
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7ddba3ff9f..1753af8f8b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+bool		deduplicate_btree_items = true;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +129,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -295,7 +300,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -340,6 +345,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 uint32 *speculativeToken)
 {
 	IndexTuple	itup = insertstate->itup;
+	IndexTuple	curitup;
+	ItemId		curitemid;
 	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
 	OffsetNumber offset;
@@ -348,6 +355,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -375,13 +385,21 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
-		ItemId		curitemid;
-		IndexTuple	curitup;
-		BlockNumber nblkno;
-
 		/*
-		 * make sure the offset points to an actual item before trying to
-		 * examine it...
+		 * Each iteration of the loop processes one heap TID, not one index
+		 * tuple.  Current offset number for page isn't usually advanced on
+		 * iterations that process heap TIDs from posting list tuples.
+		 *
+		 * "inposting" state is set when _inside_ a posting list --- not when
+		 * we're at the start (or end) of a posting list.  We advance curposti
+		 * at the end of the iteration when inside a posting list tuple.  In
+		 * general, every loop iteration either advances the page offset or
+		 * advances curposti --- an iteration that handles the rightmost/max
+		 * heap TID in a posting list finally advances the page offset (and
+		 * unsets "inposting").
+		 *
+		 * Make sure the offset points to an actual index tuple before trying
+		 * to examine it...
 		 */
 		if (offset <= maxoff)
 		{
@@ -406,31 +424,60 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				break;
 			}
 
-			curitemid = PageGetItemId(page, offset);
-
 			/*
-			 * We can skip items that are marked killed.
+			 * We can skip items that are already marked killed.
 			 *
 			 * In the presence of heavy update activity an index may contain
 			 * many killed items with the same key; running _bt_compare() on
 			 * each killed item gets expensive.  Just advance over killed
 			 * items as quickly as we can.  We only apply _bt_compare() when
-			 * we get to a non-killed item.  Even those comparisons could be
-			 * avoided (in the common case where there is only one page to
-			 * visit) by reusing bounds, but just skipping dead items is fast
-			 * enough.
+			 * we get to a non-killed item.  We could reuse the bounds to
+			 * avoid _bt_compare() calls for known equal tuples, but it
+			 * doesn't seem worth it.  Workloads with heavy update activity
+			 * tend to have many deduplication passes, so we'll often avoid
+			 * most of those comparisons, too (we call _bt_compare() when the
+			 * posting list tuple is initially encountered, though not when
+			 * processing later TIDs from the same tuple).
 			 */
-			if (!ItemIdIsDead(curitemid))
+			if (!inposting)
+				curitemid = PageGetItemId(page, offset);
+			if (inposting || !ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
 				bool		all_dead;
 
-				if (_bt_compare(rel, itup_key, page, offset) != 0)
-					break;		/* we're past all the equal tuples */
+				if (!inposting)
+				{
+					/* Plain tuple, or first TID in posting list tuple */
+					if (_bt_compare(rel, itup_key, page, offset) != 0)
+						break;	/* we're past all the equal tuples */
 
-				/* okay, we gotta fetch the heap tuple ... */
-				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+					/* Advanced curitup */
+					curitup = (IndexTuple) PageGetItem(page, curitemid);
+					Assert(!BTreeTupleIsPivot(curitup));
+				}
+
+				/* okay, we gotta fetch the heap tuple using htid ... */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					/* ... htid is from simple non-pivot tuple */
+					Assert(!inposting);
+					htid = curitup->t_tid;
+				}
+				else if (!inposting)
+				{
+					/* ... htid is first TID in new posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+					htid = *BTreeTupleGetPostingN(curitup, 0);
+				}
+				else
+				{
+					/* ... htid is second or subsequent TID in posting list */
+					Assert(curposti > 0);
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
+				}
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -506,8 +553,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -565,12 +611,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -584,14 +632,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -606,7 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
 			{
-				nblkno = opaque->btpo_next;
+				BlockNumber nblkno = opaque->btpo_next;
+
 				nbuf = _bt_relandgetbuf(rel, nbuf, nblkno, BT_READ);
 				page = BufferGetPage(nbuf);
 				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -616,6 +680,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			/* Will also advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -684,6 +751,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -699,6 +767,8 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		bool		dedupunique = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -712,9 +782,25 @@ _bt_findinsertloc(Relation rel,
 		 * tuple belongs on.  The heap TID attribute for new tuple (scantid)
 		 * could force us to insert on a sibling page, though that should be
 		 * very rare in practice.
+		 *
+		 * checkingunique inserters that encounter a duplicate will apply
+		 * deduplication when it looks like there will be a page split, but
+		 * there is no LP_DEAD garbage on the leaf page to vacuum away (or
+		 * there wasn't enough space freed by LP_DEAD cleanup).  This
+		 * complements the opportunistic LP_DEAD vacuuming mechanism.  The
+		 * high level goal is to avoid page splits caused by new, unchanged
+		 * versions of existing logical rows altogether.  See nbtree/README
+		 * for full details.
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				dedupunique = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -741,18 +827,37 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (helpful when initial page is empty) */
+				dedupunique = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, try to obtain
+		 * enough free space to avoid a page split by deduplicating existing
+		 * items (if deduplication is safe).
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates if checkingunique */
+				dedupunique = true;
+			}
+
+			if (itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz &&
+				(!checkingunique || dedupunique))
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -834,7 +939,36 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  _bt_insertonpg() cannot handle this, so delete all LP_DEAD
+		 * items early.  This is the only case where LP_DEAD deletes happen
+		 * even though a page split wouldn't take place if we went straight to
+		 * the _bt_insertonpg() call.
+		 *
+		 * Call _bt_dedup_one_page() instead of _bt_vacuum_one_page() to force
+		 * deletes (this avoids relying on the BTP_HAS_GARBAGE hint flag,
+		 * which might be falsely unset).  Call can't actually dedup items,
+		 * since we pass a newitemsz of 0.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, true);
+
+		/*
+		 * Do new binary search.  New insert location cannot overlap with any
+		 * posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -900,10 +1034,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -931,11 +1067,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -949,6 +1089,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -959,6 +1100,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->safededup);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost/max TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -991,7 +1160,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,6 +1236,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1115,8 +1288,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1139,6 +1323,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1147,7 +1332,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1189,6 +1394,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1204,12 +1417,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1229,6 +1454,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1298,6 +1524,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1335,6 +1589,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1368,6 +1624,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1383,6 +1641,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1447,6 +1706,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1475,8 +1735,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1645,8 +1913,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1665,11 +1937,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1829,7 +2125,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2185,6 +2481,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2265,7 +2562,7 @@ _bt_pgaddtup(Page page,
 static void
 _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 {
-	OffsetNumber deletable[MaxOffsetNumber];
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				minoff,
@@ -2298,6 +2595,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..72b3921119 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +674,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +702,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -964,28 +990,88 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with deleting TIDs when some (but not all) of the heap TIDs
+ * in an existing posting list item are to be removed by VACUUM.  This works
+ * by updating/overwriting an existing item with caller's new version of the
+ * item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updatable, IndexTuple *updated,
+					int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	IndexTuple	itup;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen = 0;
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			memcpy(updatedbuf + offset, itup, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber offnum = updatable[i];
+
+		itup = updated[i];
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+
+		if (!PageIndexTupleOverwrite(page, offnum, (Item) itup, itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1092,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1107,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatable,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1130,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
 }
 
 /*
@@ -1050,6 +1146,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire tuple can be
+ * deleted as a whole (since there is only one LP_DEAD bit per line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1163,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1210,83 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2232,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8376a5e6b7..912819b9db 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+									  int *nremaining);
 
 
 /*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxTIDsPerBTreePage * sizeof(int));
+				if (so->numKilled < MaxTIDsPerBTreePage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1151,11 +1153,16 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		IndexTuple	updated[MaxIndexTuplesPerPage];
+		OffsetNumber updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1187,8 +1194,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1196,11 +1206,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1223,22 +1231,86 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/* Posting list tuple */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All table TIDs from the posting tuple remain, so no
+						 * delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this new tuple and the offset of the tuple
+						 * to be updated for the page's _bt_delitems_vacuum()
+						 * call.
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All table TIDs from the posting list must be
+						 * deleted.  We'll delete the index tuple completely
+						 * (no update).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updated[i]);
 		}
 		else
 		{
@@ -1251,6 +1323,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1260,15 +1333,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
-		 * freePages out-of-order (doesn't seem worth any extra code to handle
-		 * the case).
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as separate live tuples).
+		 * We don't delete when recursing, though, to avoid putting entries
+		 * into freePages out-of-order (doesn't seem worth any extra code to
+		 * handle the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1300,9 +1376,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1311,6 +1388,67 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns new palloc'd array of item pointers needed to build
+ * replacement posting list tuple without the TIDs that VACUUM needs to
+ * delete.  Returned value is NULL in the common case no changes are
+ * needed in caller's posting list tuple (we avoid allocating memory
+ * here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is
+ * set for caller in *nremaining.  This indicates the number of elements
+ * in the returned array (assuming that return value isn't just NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple posting, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(posting);
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live table TID.
+			 *
+			 * Only save live TID when we already know that we're going to
+			 * have to kill at least one TID, and have already allocated
+			 * memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live TIDs skipped in previous iterations, if any */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..c8c8ee057d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->safededup);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +693,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * Scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxTIDsPerBTreePage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..dddfe14d59 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple had a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  Even still, the lastleft and firstright
+			 * tuples passed to _bt_truncate() here are at least not fully
+			 * equal to each other when deduplication is used, unless there is
+			 * a large group of duplicates (also, unique index builds usually
+			 * have few or no spool2 duplicates).  When the split point is
+			 * between two unequal tuples, _bt_truncate() will avoid including
+			 * a heap TID in the new high key, which is the most important
+			 * benefit of suffix truncation.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1007,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1069,43 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	Assert(dstate->nitems > 0);
+
+	if (dstate->nitems == 1)
+		_bt_buildadd(wstate, state, dstate->base, 0);
+	else
+	{
+		IndexTuple	postingtuple;
+		Size		truncextra;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		/* Calculate posting list overhead */
+		truncextra = IndexTupleSize(postingtuple) -
+			BTreeTupleGetPostingOffset(postingtuple);
+
+		_bt_buildadd(wstate, state, postingtuple, truncextra);
+		pfree(postingtuple);
+	}
+
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1151,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1172,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1194,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1293,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1308,100 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->maxpostingsize = 0; /* set later */
+		dstate->checkingunique = false; /* unused */
+		dstate->skippedbase = InvalidOffsetNumber;
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to 1/10 space we want to
+				 * leave behind on the page, plus space for final item's line
+				 * pointer.  This is equal to the space that we'd like to
+				 * leave behind on each leaf page when fillfactor is 90,
+				 * allowing us to get close to fillfactor% space utilization
+				 * when there happen to be a great many duplicates.  (This
+				 * makes higher leaf fillfactor settings ineffective when
+				 * building indexes that have many duplicates, but packing
+				 * leaf pages full with few very large tuples doesn't seem
+				 * like a useful goal.)
+				 */
+				dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+					sizeof(ItemIdData);
+				Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+					   dstate->maxpostingsize <= INDEX_SIZE_MASK);
+				dstate->htids = palloc(dstate->maxpostingsize);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list.  Heap
+				 * TID from itup has been saved in state.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * _bt_dedup_save_htid() opted to not merge current item into
+				 * pending posting list.
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				pfree(dstate->base);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		if (state)
+		{
+			/*
+			 * Handle the last item (there must be a last item when the
+			 * tuplesort returned one or more tuples)
+			 */
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1409,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..5ed09640ad 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplicate_items", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplicate_items)}
 
 	};
 
@@ -2118,11 +2184,10 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
@@ -2138,6 +2203,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2232,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2251,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(firstright))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include the max lastleft
+			 * heap TID in the final pivot tuple, but we can remove the
+			 * posting list now. (Pivot tuples should never contain a posting
+			 * list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2291,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2302,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2320,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2329,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2411,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not deduplication-safe.  This weaker
+ * guarantee is good enough for nbtsplitloc.c caller, since false negatives
+ * generally only have the effect of making leaf page splits use a more
+ * balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2392,28 +2492,42 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * Mask allocated for number of keys in index tuple must be able to fit
 	 * maximum possible number of index attributes
 	 */
-	StaticAssertStmt(BT_N_KEYS_OFFSET_MASK >= INDEX_MAX_KEYS,
-					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
+	StaticAssertStmt(BT_OFFSET_MASK >= INDEX_MAX_KEYS,
+					 "BT_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2571,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2602,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2676,53 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the storage parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..1d1cd7b667 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,56 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buf);
+		OffsetNumber offnum;
+		BTDedupState state;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		/* Conservatively use larger maxpostingsize than primary */
+		state->maxpostingsize = BTMaxItemSize(page);
+		state->checkingunique = false;	/* unused */
+		state->skippedbase = InvalidOffsetNumber;
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = palloc(state->maxpostingsize);
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+
+		for (offnum = xlrec->baseoff;
+			 offnum < xlrec->baseoff + xlrec->nitems;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == xlrec->baseoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (!_bt_dedup_save_htid(state, itup))
+				elog(ERROR, "could not add heap tid to pending posting list");
+		}
+
+		Assert(state->nitems == xlrec->nitems);
+		_bt_dedup_finish_pending(buf, state, false);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +533,32 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				Assert(BTreeTupleIsPosting(updated));
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) updated, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +877,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +900,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +928,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..68fad1c91f 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,25 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "baseoff %u; nitems %u",
+								 xlrec->baseoff, xlrec->nitems);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +141,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f47176753d..32ff03b3e4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1055,8 +1055,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1142,7 +1144,8 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	}
 
 	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e5f8a1301f..73da2716f2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -1091,6 +1092,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		check_bonjour, NULL, NULL
 	},
+	{
+		{"deduplicate_btree_items", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&deduplicate_btree_items,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
 			gettext_noop("Collects transaction commit time."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e1048c0047..b3a98345fa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -652,6 +652,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#deduplicate_btree_items = on
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index b52396c17a..987e6400e3 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1685,14 +1685,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplicate_items",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplicate_items =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..359b5c18dc 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_plain_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
+		 * bt_posting_plain_tuple() for definition, and details of how posting
+		 * list tuples are handled.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +927,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +958,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1000,20 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * TID, since the posting list itself is validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1025,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is a posting list tuple, make sure posting list TIDs are
+		 * in order
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1091,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1117,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "plain" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_plain_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1150,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1191,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && BTreeTupleIsPosting(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1219,8 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		/* Reset, in case scantid was set to (itup) posting tuple's max TID */
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2027,9 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2042,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2107,29 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "plain" tuple for nth posting list entry/TID.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple index tuples are merged together into one equivalent
+ * posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "plain"
+ * tuples.  Each tuple must be fingerprinted separately -- there must be one
+ * tuple for each corresponding Bloom filter probe during the heap scan.
+ *
+ * Note: Caller still needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_plain_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2186,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2194,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2650,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..da7007135d 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,127 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer.  Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   A duplicate is a row where <emphasis>all</emphasis> indexed key
+   columns are equal to the corresponding column values from some
+   other row.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values), a
+   storage size of about one third of the unoptimized case is
+   expected.  There is usually a direct benefit for queries, as well
+   as an indirect benefit due to reduced I/O during routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplicate-items"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   enabled.  The <literal>deduplicate_items</literal> storage
+   parameter can be used to override the configuration paramater for
+   individual indexes.  See <xref
+   linkend="sql-createindex-storage-parameters"/> from the
+   <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with a B-Tree index when
+   <emphasis>all</emphasis> indexed columns use a deduplication-safe
+   operator class that explicitly indicates that deduplication is safe
+   at <command>CREATE INDEX</command> time.  In practice almost all
+   datatypes support deduplication.  <type>numeric</type> is a notable
+   exception (<quote>display scale</quote> makes it impossible to
+   enable deduplication without losing useful information about equal
+   <type>numeric</type> datums).  Some operator classes support
+   deduplication conditionally.  For example, deduplication of indexes
+   on a <type>text</type> column (with the default
+   <literal>btree/text_ops</literal> operator class) is not supported
+   when the column uses a  nondeterministic collation.
+  </para>
+  <para>
+   <literal>INCLUDE</literal> indexes do not support deduplication.
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Internal use of Deduplication in unique indexes</title>
+
+  <para>
+   Page splits that occur due to inserting multiple physical versions
+   (rather than inserting new logical rows) tend to degrade the
+   structure of indexes, especially in the case of unique indexes.
+   Unique indexes use deduplication <emphasis>internally</emphasis>
+   and <emphasis>selectively</emphasis> to delay (and ideally to
+   prevent) these <quote>unnecessary</quote> page splits.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but usually cannot
+   reverse page splits (in general, the page must be completely empty
+   before <command>VACUUM</command> can <quote>delete</quote> it).
+  </para>
+  <para>
+   The <xref linkend="guc-btree-deduplicate-items"/> configuration
+   parameter does not affect whether or not deduplication is used
+   within unique indexes.  The internal use of deduplication for
+   unique indexes is subject to all of the same restrictions as
+   deduplication in general.  The <literal>deduplicate_items</literal>
+   storage parameter can be set to <literal>OFF</literal> to disable
+   deduplication in unique indexes, but this is intended only as a
+   debugging option for developers.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d45b6f7cb..55a80d9f4e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8041,6 +8041,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplicate-items" xreflabel="deduplicate_btree_items">
+      <term><varname>deduplicate_btree_items</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>deduplicate_btree_items</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls whether deduplication should be used within B-Tree
+        indexes.  Deduplication is an optimization that reduces the
+        storage size of indexes by storing equal index keys only once.
+        See <xref linkend="btree-deduplication"/> for more
+        information.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..6659d15bf4 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -388,10 +390,40 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
+    <term><literal>deduplicate_items</literal>
+     <indexterm>
+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref
+      linkend="guc-btree-deduplicate-items"/>.  Controls usage of the
+      B-tree deduplication technique described in <xref
+      linkend="btree-deduplication"/>.  Set to <literal>ON</literal>
+      or <literal>OFF</literal> to override GUC.  (Alternative
+      spellings of <literal>ON</literal> and <literal>OFF</literal>
+      are allowed as described in <xref linkend="config-setting"/>.)
+      The default is <literal>ON</literal>.
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplicate_items</literal> off via
+      <command>ALTER INDEX</command> prevents future insertions from
+      triggering deduplication, but does not in itself make existing
+      posting list tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -446,9 +478,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..3d353cefdf 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..b0b81b2b9a 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

v30-0002-Header-for-customized-qsort.patchapplication/octet-stream; name=v30-0002-Header-for-customized-qsort.patchDownload

From 45664e5282fa217ac773743ac75090d8b52bbf3a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 14 Jan 2020 14:01:55 -0800
Subject: [PATCH v30 2/5] Header for customized qsort.

This is a dependency of the "Bucket sort for compactify_tuples" commit
that follows.  (We probably won't need to inline qsort, but leave it for
now.)

This patch was retrieved from:
https://postgr.es/m/CAL-rCA2n7UfVu1Ui0f%2B7cVN4vAKVM0%2B-cZKb_ka6-mGQBAF92w%40mail.gmail.com
---
 src/include/lib/qsort_template.h          | 325 ++++++++++++++++++++++
 src/port/qsort.c                          |   2 +-
 src/port/qsort_arg.c                      |   2 +-
 src/backend/utils/sort/Makefile           |   7 -
 src/backend/utils/sort/gen_qsort_tuple.pl | 270 ------------------
 src/backend/utils/sort/tuplesort.c        |  27 +-
 src/tools/msvc/Solution.pm                |  10 +
 src/tools/msvc/clean.bat                  |   1 -
 8 files changed, 363 insertions(+), 281 deletions(-)
 create mode 100644 src/include/lib/qsort_template.h
 delete mode 100644 src/backend/utils/sort/gen_qsort_tuple.pl

diff --git a/src/include/lib/qsort_template.h b/src/include/lib/qsort_template.h
new file mode 100644
index 0000000000..3d9ffbd81b
--- /dev/null
+++ b/src/include/lib/qsort_template.h
@@ -0,0 +1,325 @@
+/*	$NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $	*/
+
+/*-
+ * Copyright (c) 1992, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	  notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	  notice, this list of conditions and the following disclaimer in the
+ *	  documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *	  may be used to endorse or promote products derived from this software
+ *	  without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * Qsort routine based on J. L. Bentley and M. D. McIlroy,
+ * "Engineering a sort function",
+ * Software--Practice and Experience 23 (1993) 1249-1265.
+ *
+ * We have modified their original by adding a check for already-sorted input,
+ * which seems to be a win per discussions on pgsql-hackers around 2006-03-21.
+ *
+ * Also, we recurse on the smaller partition and iterate on the larger one,
+ * which ensures we cannot recurse more than log(N) levels (since the
+ * partition recursed to is surely no more than half of the input).  Bentley
+ * and McIlroy explicitly rejected doing this on the grounds that it's "not
+ * worth the effort", but we have seen crashes in the field due to stack
+ * overrun, so that judgment seems wrong.
+ */
+
+
+/*
+ * Template parameters are:
+ * QS_SUFFIX - name suffix.
+ * QS_TYPE - array element type.
+ * QS_CMP  - Function used to compare elements, should be defined
+ *           before inclusion, or passed using QS_EXTRAPARAMS/QS_EXTRAARGS.
+ *           Default is `cmp_##QS_SUFFIX`
+ * QS_EXTRAPARAMS - extra parameters consumed by qsort and insertion sort.
+ * QS_EXTRAARGS - extra arguments passed to qsort and insertion sort.
+ * QS_CMPARGS - extra arguments passed to cmp function.
+ *
+ * if QS_EXTRAPARAMS, QS_EXTRAARGS and QS_CMPARGS are all undefined, then
+ * QS_CMP assummed to be predefined and accepting only elements to compare
+ * (ie no extra parameters).
+ *
+ * QS_DECLARE - if defined function prototypes and type declarations are
+ *		        generated
+ * QS_DEFINE - if defined function definitions are generated
+ * QS_SCOPE - in which scope (e.g. extern, static) do function declarations reside
+ *
+ * QS_CHECK_FOR_INTERRUPTS - if defined, then CHECK_FOR_INTERRUPTS is called
+ *			periodically.
+ * QS_CHECK_PRESORTED - if defined, check for presorted array is included.
+ */
+
+#ifndef QS_SUFFIX
+#error "QS_SUFFIX should be defined"
+#endif
+#ifndef QS_TYPE
+#error "QS_TYPE should be defined"
+#endif
+
+/* helpers */
+#define QS_MAKE_NAME_(a, b) CppConcat(a, b)
+#define QS_MAKE_NAME(a) QS_MAKE_NAME_(a, QS_SUFFIX)
+
+#define QS_SWAPMANY QS_MAKE_NAME(qsswapmany_)
+#define QS_SWAPONE QS_MAKE_NAME(qsswapone_)
+#define QS_MED3 QS_MAKE_NAME(qsmed3_)
+#define QS_INSERTION_SORT QS_MAKE_NAME(insertion_sort_)
+#define QS_QSORT QS_MAKE_NAME(qsort_)
+#define QS_QSORT_IMPL QS_MAKE_NAME(qsort_impl_)
+#ifndef QS_CMP
+#define QS_CMP QS_MAKE_NAME(cmp_)
+#endif
+
+#if !defined(QS_EXTRAPARAMS) && !defined(QS_EXTRAARGS) && !defined(QS_CMPARGS)
+#define QS_EXTRAPARAMS
+#define QS_EXTRAARGS
+#define QS_CMPARGS
+#else
+#ifndef QS_EXTRAPARAMS
+#error "QS_EXTRAPARAMS should be defined"
+#endif
+#ifndef QS_EXTRAARGS
+#error "QS_EXTRAARGS should be defined"
+#endif
+#ifndef QS_CMPARGS
+#error "QS_CMPARGS should be defined"
+#endif
+#endif
+
+/* generate forward declarations necessary to use the hash table */
+#ifdef QS_DECLARE
+QS_SCOPE void QS_INSERTION_SORT(QS_TYPE * a, QS_TYPE * b QS_EXTRAPARAMS);
+QS_SCOPE void QS_QSORT(QS_TYPE * a, QS_TYPE * b QS_EXTRAPARAMS);
+#endif
+
+#ifdef QS_DEFINE
+static inline void
+QS_SWAPONE(QS_TYPE * a, QS_TYPE * b)
+{
+	QS_TYPE		t = *a;
+
+	*a = *b;
+	*b = t;
+}
+
+static void
+QS_SWAPMANY(QS_TYPE * a, QS_TYPE * b, size_t n)
+{
+	for (; n > 0; n--, a++, b++)
+		QS_SWAPONE(a, b);
+}
+
+#ifndef QS_SKIP_MED3
+static QS_TYPE *
+QS_MED3(QS_TYPE * a, QS_TYPE * b, QS_TYPE * c QS_EXTRAPARAMS)
+{
+	return QS_CMP(a, b QS_CMPARGS) < 0 ?
+		(QS_CMP(b, c QS_CMPARGS) < 0 ? b :
+		 (QS_CMP(a, c QS_CMPARGS) < 0 ? c : a))
+		: (QS_CMP(b, c QS_CMPARGS) > 0 ? b :
+		   (QS_CMP(a, c QS_CMPARGS) < 0 ? a : c));
+}
+#endif
+
+QS_SCOPE void
+QS_INSERTION_SORT(QS_TYPE * a, size_t n QS_EXTRAPARAMS)
+{
+	QS_TYPE    *pm,
+			   *pl;
+
+	for (pm = a + 1; pm < a + n; pm++)
+		for (pl = pm; pl > a && QS_CMP(pl - 1, pl QS_CMPARGS) > 0; pl--)
+			QS_SWAPONE(pl, pl - 1);
+}
+
+#ifdef QS_CHECK_FOR_INTERRUPTS
+#define DO_CHECK_FOR_INTERRUPTS() CHECK_FOR_INTERRUPTS()
+#else
+#define DO_CHECK_FOR_INTERRUPTS()
+#endif
+
+QS_SCOPE void
+QS_QSORT(QS_TYPE * a, size_t n QS_EXTRAPARAMS)
+{
+	QS_TYPE    *pa,
+			   *pb,
+			   *pc,
+			   *pd,
+			   *pm,
+			   *pn;
+	size_t		d1,
+				d2;
+	int			r;
+
+loop:
+	DO_CHECK_FOR_INTERRUPTS();
+	if (n < 7)
+	{
+		QS_INSERTION_SORT(a, n QS_EXTRAARGS);
+		return;
+	}
+
+#ifdef QS_CHECK_PRESORTED
+	{
+		int			presorted = 1;
+
+		for (pm = a + 1; pm < a + n; pm++)
+		{
+			DO_CHECK_FOR_INTERRUPTS();
+			if (QS_CMP(pm - 1, pm QS_CMPARGS) > 0)
+			{
+				presorted = 0;
+				break;
+			}
+		}
+		if (presorted)
+			return;
+	}
+#endif
+	pm = a + (n / 2);
+#ifndef QS_SKIP_MED3
+	if (n > 7)
+	{
+		QS_TYPE    *pl = a;
+
+		pn = a + (n - 1);
+#ifndef QS_SKIP_MED_OF_MED
+		if (n > 40)
+		{
+			size_t		d = (n / 8);
+
+			pl = QS_MED3(pl, pl + d, pl + 2 * d QS_EXTRAARGS);
+			pm = QS_MED3(pm - d, pm, pm + d QS_EXTRAARGS);
+			pn = QS_MED3(pn - 2 * d, pn - d, pn QS_EXTRAARGS);
+		}
+#endif
+		pm = QS_MED3(pl, pm, pn QS_EXTRAARGS);
+	}
+#endif
+	QS_SWAPONE(a, pm);
+	pa = pb = a + 1;
+	pc = pd = a + (n - 1);
+	for (;;)
+	{
+		while (pb <= pc && (r = QS_CMP(pb, a QS_CMPARGS)) <= 0)
+		{
+			if (r == 0)
+			{
+				QS_SWAPONE(pa, pb);
+				pa++;
+			}
+			pb++;
+			DO_CHECK_FOR_INTERRUPTS();
+		}
+		while (pb <= pc && (r = QS_CMP(pc, a QS_CMPARGS)) >= 0)
+		{
+			if (r == 0)
+			{
+				QS_SWAPONE(pc, pd);
+				pd--;
+			}
+			pc--;
+			DO_CHECK_FOR_INTERRUPTS();
+		}
+		if (pb > pc)
+			break;
+		QS_SWAPONE(pb, pc);
+		pb++;
+		pc--;
+	}
+	pn = a + n;
+	d1 = Min(pa - a, pb - pa);
+	if (d1 > 0)
+		QS_SWAPMANY(a, pb - d1, d1);
+	d1 = Min(pd - pc, pn - pd - 1);
+	if (d1 > 0)
+		QS_SWAPMANY(pb, pn - d1, d1);
+	d1 = pb - pa;
+	d2 = pd - pc;
+	if (d1 <= d2)
+	{
+		/* Recurse on left partition, then iterate on right partition */
+		if (d1 > 1)
+			QS_QSORT(a, d1 QS_EXTRAARGS);
+		if (d2 > 1)
+		{
+			/* Iterate rather than recurse to save stack space */
+			/* QS_QSORT(pn - d2, d2 EXTRAARGS); */
+			a = pn - d2;
+			n = d2;
+			goto loop;
+		}
+	}
+	else
+	{
+		/* Recurse on right partition, then iterate on left partition */
+		if (d2 > 1)
+			QS_QSORT(pn - d2, d2 QS_EXTRAARGS);
+		if (d1 > 1)
+		{
+			/* Iterate rather than recurse to save stack space */
+			/* QS_QSORT(a, d1 EXTRAARGS); */
+			n = d1;
+			goto loop;
+		}
+	}
+}
+#endif
+
+#undef QS_SUFFIX
+#undef QS_TYPE
+#ifdef QS_DECLARE
+#undef QS_DECLARE
+#endif
+#ifdef QS_DEFINE
+#undef QS_DEFINE
+#endif
+#ifdef QS_SCOPE
+#undef QS_SCOPE
+#endif
+#ifdef QS_CHECK_FOR_INTERRUPTS
+#undef QS_CHECK_FOR_INTERRUPTS
+#endif
+#ifdef QS_CHECK_PRESORTED
+#undef QS_CHECK_PRESORTED
+#endif
+#ifdef QS_SKIP_MED3
+#undef QS_SKIP_MED3
+#endif
+#ifdef QS_SKIP_MED_OF_MED
+#undef QS_SKIP_MED_OF_MED
+#endif
+#undef DO_CHECK_FOR_INTERRUPTS
+#undef QS_MAKE_NAME
+#undef QS_SWAPMANY
+#undef QS_SWAPONE
+#undef QS_MED3
+#undef QS_INSERTION_SORT
+#undef QS_QSORT
+#undef QS_QSORT_IMPL
+#undef QS_CMP
+#undef QS_EXTRAPARAMS
+#undef QS_EXTRAARGS
+#undef QS_CMPARGS
diff --git a/src/port/qsort.c b/src/port/qsort.c
index 409f69a128..b38b33a63a 100644
--- a/src/port/qsort.c
+++ b/src/port/qsort.c
@@ -8,7 +8,7 @@
  *	  in favor of a simple check for presorted input.
  *	  Take care to recurse on the smaller partition, to bound stack usage.
  *
- *	CAUTION: if you change this file, see also qsort_arg.c, gen_qsort_tuple.pl
+ *	CAUTION: if you change this file, see also qsort_arg.c, qsort_template.h
  *
  *	src/port/qsort.c
  */
diff --git a/src/port/qsort_arg.c b/src/port/qsort_arg.c
index a2194b0fb1..9bc7e6edca 100644
--- a/src/port/qsort_arg.c
+++ b/src/port/qsort_arg.c
@@ -8,7 +8,7 @@
  *	  in favor of a simple check for presorted input.
  *	  Take care to recurse on the smaller partition, to bound stack usage.
  *
- *	CAUTION: if you change this file, see also qsort.c, gen_qsort_tuple.pl
+ *	CAUTION: if you change this file, see also qsort_arg.c, qsort_template.h
  *
  *	src/port/qsort_arg.c
  */
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 7ac3659261..fbd00046f4 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -21,12 +21,5 @@ OBJS = \
 	tuplesort.o \
 	tuplestore.o
 
-tuplesort.o: qsort_tuple.c
-
-qsort_tuple.c: gen_qsort_tuple.pl
-	$(PERL) $(srcdir)/gen_qsort_tuple.pl $< > $@
-
 include $(top_srcdir)/src/backend/common.mk
 
-maintainer-clean:
-	rm -f qsort_tuple.c
diff --git a/src/backend/utils/sort/gen_qsort_tuple.pl b/src/backend/utils/sort/gen_qsort_tuple.pl
deleted file mode 100644
index b6b2ffa7d0..0000000000
--- a/src/backend/utils/sort/gen_qsort_tuple.pl
+++ /dev/null
@@ -1,270 +0,0 @@
-#!/usr/bin/perl -w
-
-#
-# gen_qsort_tuple.pl
-#
-# This script generates specialized versions of the quicksort algorithm for
-# tuple sorting.  The quicksort code is derived from the NetBSD code.  The
-# code generated by this script runs significantly faster than vanilla qsort
-# when used to sort tuples.  This speedup comes from a number of places.
-# The major effects are (1) inlining simple tuple comparators is much faster
-# than jumping through a function pointer and (2) swap and vecswap operations
-# specialized to the particular data type of interest (in this case, SortTuple)
-# are faster than the generic routines.
-#
-#	Modifications from vanilla NetBSD source:
-#	  Add do ... while() macro fix
-#	  Remove __inline, _DIAGASSERTs, __P
-#	  Remove ill-considered "swap_cnt" switch to insertion sort,
-#	  in favor of a simple check for presorted input.
-#	  Take care to recurse on the smaller partition, to bound stack usage.
-#
-#     Instead of sorting arbitrary objects, we're always sorting SortTuples.
-#     Add CHECK_FOR_INTERRUPTS().
-#
-# CAUTION: if you change this file, see also qsort.c and qsort_arg.c
-#
-
-use strict;
-
-my $SUFFIX;
-my $EXTRAARGS;
-my $EXTRAPARAMS;
-my $CMPPARAMS;
-
-emit_qsort_boilerplate();
-
-$SUFFIX      = 'tuple';
-$EXTRAARGS   = ', SortTupleComparator cmp_tuple, Tuplesortstate *state';
-$EXTRAPARAMS = ', cmp_tuple, state';
-$CMPPARAMS   = ', state';
-emit_qsort_implementation();
-
-$SUFFIX      = 'ssup';
-$EXTRAARGS   = ', SortSupport ssup';
-$EXTRAPARAMS = ', ssup';
-$CMPPARAMS   = ', ssup';
-print <<'EOM';
-
-#define cmp_ssup(a, b, ssup) \
-	ApplySortComparator((a)->datum1, (a)->isnull1, \
-						(b)->datum1, (b)->isnull1, ssup)
-
-EOM
-emit_qsort_implementation();
-
-sub emit_qsort_boilerplate
-{
-	print <<'EOM';
-/*
- * autogenerated by src/backend/utils/sort/gen_qsort_tuple.pl, do not edit!
- *
- * This file is included by tuplesort.c, rather than compiled separately.
- */
-
-/*	$NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $	*/
-
-/*-
- * Copyright (c) 1992, 1993
- *	The Regents of the University of California.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *	  notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *	  notice, this list of conditions and the following disclaimer in the
- *	  documentation and/or other materials provided with the distribution.
- * 3. Neither the name of the University nor the names of its contributors
- *	  may be used to endorse or promote products derived from this software
- *	  without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-/*
- * Qsort routine based on J. L. Bentley and M. D. McIlroy,
- * "Engineering a sort function",
- * Software--Practice and Experience 23 (1993) 1249-1265.
- *
- * We have modified their original by adding a check for already-sorted input,
- * which seems to be a win per discussions on pgsql-hackers around 2006-03-21.
- *
- * Also, we recurse on the smaller partition and iterate on the larger one,
- * which ensures we cannot recurse more than log(N) levels (since the
- * partition recursed to is surely no more than half of the input).  Bentley
- * and McIlroy explicitly rejected doing this on the grounds that it's "not
- * worth the effort", but we have seen crashes in the field due to stack
- * overrun, so that judgment seems wrong.
- */
-
-static void
-swapfunc(SortTuple *a, SortTuple *b, size_t n)
-{
-	do
-	{
-		SortTuple 	t = *a;
-		*a++ = *b;
-		*b++ = t;
-	} while (--n > 0);
-}
-
-#define swap(a, b)						\
-	do { 								\
-		SortTuple t = *(a);				\
-		*(a) = *(b);					\
-		*(b) = t;						\
-	} while (0);
-
-#define vecswap(a, b, n) if ((n) > 0) swapfunc(a, b, n)
-
-EOM
-
-	return;
-}
-
-sub emit_qsort_implementation
-{
-	print <<EOM;
-static SortTuple *
-med3_$SUFFIX(SortTuple *a, SortTuple *b, SortTuple *c$EXTRAARGS)
-{
-	return cmp_$SUFFIX(a, b$CMPPARAMS) < 0 ?
-		(cmp_$SUFFIX(b, c$CMPPARAMS) < 0 ? b :
-			(cmp_$SUFFIX(a, c$CMPPARAMS) < 0 ? c : a))
-		: (cmp_$SUFFIX(b, c$CMPPARAMS) > 0 ? b :
-			(cmp_$SUFFIX(a, c$CMPPARAMS) < 0 ? a : c));
-}
-
-static void
-qsort_$SUFFIX(SortTuple *a, size_t n$EXTRAARGS)
-{
-	SortTuple  *pa,
-			   *pb,
-			   *pc,
-			   *pd,
-			   *pl,
-			   *pm,
-			   *pn;
-	size_t		d1,
-				d2;
-	int			r,
-				presorted;
-
-loop:
-	CHECK_FOR_INTERRUPTS();
-	if (n < 7)
-	{
-		for (pm = a + 1; pm < a + n; pm++)
-			for (pl = pm; pl > a && cmp_$SUFFIX(pl - 1, pl$CMPPARAMS) > 0; pl--)
-				swap(pl, pl - 1);
-		return;
-	}
-	presorted = 1;
-	for (pm = a + 1; pm < a + n; pm++)
-	{
-		CHECK_FOR_INTERRUPTS();
-		if (cmp_$SUFFIX(pm - 1, pm$CMPPARAMS) > 0)
-		{
-			presorted = 0;
-			break;
-		}
-	}
-	if (presorted)
-		return;
-	pm = a + (n / 2);
-	if (n > 7)
-	{
-		pl = a;
-		pn = a + (n - 1);
-		if (n > 40)
-		{
-			size_t		d = (n / 8);
-
-			pl = med3_$SUFFIX(pl, pl + d, pl + 2 * d$EXTRAPARAMS);
-			pm = med3_$SUFFIX(pm - d, pm, pm + d$EXTRAPARAMS);
-			pn = med3_$SUFFIX(pn - 2 * d, pn - d, pn$EXTRAPARAMS);
-		}
-		pm = med3_$SUFFIX(pl, pm, pn$EXTRAPARAMS);
-	}
-	swap(a, pm);
-	pa = pb = a + 1;
-	pc = pd = a + (n - 1);
-	for (;;)
-	{
-		while (pb <= pc && (r = cmp_$SUFFIX(pb, a$CMPPARAMS)) <= 0)
-		{
-			if (r == 0)
-			{
-				swap(pa, pb);
-				pa++;
-			}
-			pb++;
-			CHECK_FOR_INTERRUPTS();
-		}
-		while (pb <= pc && (r = cmp_$SUFFIX(pc, a$CMPPARAMS)) >= 0)
-		{
-			if (r == 0)
-			{
-				swap(pc, pd);
-				pd--;
-			}
-			pc--;
-			CHECK_FOR_INTERRUPTS();
-		}
-		if (pb > pc)
-			break;
-		swap(pb, pc);
-		pb++;
-		pc--;
-	}
-	pn = a + n;
-	d1 = Min(pa - a, pb - pa);
-	vecswap(a, pb - d1, d1);
-	d1 = Min(pd - pc, pn - pd - 1);
-	vecswap(pb, pn - d1, d1);
-	d1 = pb - pa;
-	d2 = pd - pc;
-	if (d1 <= d2)
-	{
-		/* Recurse on left partition, then iterate on right partition */
-		if (d1 > 1)
-			qsort_$SUFFIX(a, d1$EXTRAPARAMS);
-		if (d2 > 1)
-		{
-			/* Iterate rather than recurse to save stack space */
-			/* qsort_$SUFFIX(pn - d2, d2$EXTRAPARAMS); */
-			a = pn - d2;
-			n = d2;
-			goto loop;
-		}
-	}
-	else
-	{
-		/* Recurse on right partition, then iterate on left partition */
-		if (d2 > 1)
-			qsort_$SUFFIX(pn - d2, d2$EXTRAPARAMS);
-		if (d1 > 1)
-		{
-			/* Iterate rather than recurse to save stack space */
-			/* qsort_$SUFFIX(a, d1$EXTRAPARAMS); */
-			n = d1;
-			goto loop;
-		}
-	}
-}
-EOM
-
-	return;
-}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d02e676aa3..ecb9241fad 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -655,8 +655,33 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts
  * and Datum sorts.
  */
-#include "qsort_tuple.c"
+#define QS_SUFFIX tuple
+#define QS_TYPE SortTuple
+#define QS_EXTRAPARAMS , SortTupleComparator cmp_tuple, Tuplesortstate *state
+#define QS_EXTRAARGS , cmp_tuple, state
+#define QS_CMPARGS , state
+#define QS_CHECK_FOR_INTERRUPTS
+#define QS_CHECK_PRESORTED
+#define QS_SCOPE static
+#define QS_DEFINE
+#include "lib/qsort_template.h"
 
+#define QS_SUFFIX ssup
+#define QS_TYPE SortTuple
+#define QS_EXTRAPARAMS , SortSupport ssup
+#define QS_EXTRAARGS , ssup
+#define QS_CMPARGS , ssup
+#define QS_CHECK_FOR_INTERRUPTS
+#define QS_CHECK_PRESORTED
+#define QS_SCOPE static
+static inline int
+cmp_ssup(SortTuple *a, SortTuple *b, SortSupport ssup)
+{
+	return ApplySortComparator(a->datum1, a->isnull1,
+							   b->datum1, b->isnull1, ssup);
+}
+#define QS_DEFINE
+#include "lib/qsort_template.h"
 
 /*
  *		tuplesort_begin_xxx
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 909bded592..7cfd0fe863 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -668,6 +668,16 @@ sub GenerateFiles
 		);
 	}
 
+	if (IsNewer(
+			'src/backend/utils/sort/qsort_tuple.c',
+			'src/backend/utils/sort/gen_qsort_tuple.pl'))
+	{
+		print "Generating qsort_tuple.c...\n";
+		system(
+'perl src/backend/utils/sort/gen_qsort_tuple.pl > src/backend/utils/sort/qsort_tuple.c'
+		);
+	}
+
 	if (IsNewer(
 			'src/interfaces/libpq/libpq.rc',
 			'src/interfaces/libpq/libpq.rc.in'))
diff --git a/src/tools/msvc/clean.bat b/src/tools/msvc/clean.bat
index d034ec5765..b1eb759f52 100755
--- a/src/tools/msvc/clean.bat
+++ b/src/tools/msvc/clean.bat
@@ -61,7 +61,6 @@ if %DIST%==1 if exist src\backend\storage\lmgr\lwlocknames.h del /q src\backend\
 if %DIST%==1 if exist src\pl\plpython\spiexceptions.h del /q src\pl\plpython\spiexceptions.h
 if %DIST%==1 if exist src\pl\plpgsql\src\plerrcodes.h del /q src\pl\plpgsql\src\plerrcodes.h
 if %DIST%==1 if exist src\pl\tcl\pltclerrcodes.h del /q src\pl\tcl\pltclerrcodes.h
-if %DIST%==1 if exist src\backend\utils\sort\qsort_tuple.c del /q src\backend\utils\sort\qsort_tuple.c
 if %DIST%==1 if exist src\bin\psql\sql_help.c del /q src\bin\psql\sql_help.c
 if %DIST%==1 if exist src\bin\psql\sql_help.h del /q src\bin\psql\sql_help.h
 if %DIST%==1 if exist src\common\kwlist_d.h del /q src\common\kwlist_d.h
-- 
2.17.1

#131

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#130)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Jan 14, 2020 at 6:08 PM Peter Geoghegan <pg@bowt.ie> wrote:

Still no progress on these items, but I am now posting v30. A new
version seems warranted, because I now want to revive a patch from a
couple of years back as part of the deduplication project -- it would
be good to get feedback on that sooner rather than later.

Actually, I decided that this wasn't necessary -- I won't be touching
compactify_tuples() at all (at least not right now). Deduplication
doesn't even need to use PageIndexMultiDelete() in the attached
revision of the patch, v31, so speeding up compactify_tuples() is no
longer relevant.

v31 simplifies everything quite a bit. This is something that I came
up with more or less as a result of following Heikki's feedback. I
found that reviving the v17 approach of using a temp page buffer in
_bt_dedup_one_page() (much like _bt_split() always has) was a good
idea. This approach was initially revived in order to make dedup WAL
logging work on a whole-page basis -- Heikki suggested we do it that
way, and so now we do. But this approach is also a lot faster in
general, and has additional benefits besides that.

When we abandoned the temp buffer approach back in September of last
year, the unique index stuff was totally immature and unsettled, and
it looked like a very incremental approach might make sense for unique
indexes. It doesn't seem like a good idea now, though. In fact, I no
longer even believe that a custom checkingunique/unique index strategy
in _bt_dedup_one_page() is useful. That is also removed in v31, which
will also make Heikki happy -- he expressed a separate concern about
the extra complexity there.

I've done a lot of optimization work since September, making these
simplification possible now. The problems that I saw that justified
the complexity seem to have gone away now. I'm pretty sure that the
recent _bt_check_unique() posting list tuple _bt_compare()
optimization is the biggest part of that. The checkingunique/unique
index strategy in _bt_dedup_one_page() always felt overfit to my
microbenchmarks, so I'm glad to be rid of it.

Note that v31 changes nothing about how we think about deduplication
in unique indexes in general, nor how it is presented to users. There
is still special criteria around how deduplication is *triggered* in
unique indexes. We continue to trigger a deduplication pass based on
seeing a duplicate within _bt_check_unique() + _bt_findinsertloc() --
otherwise we never attempt deduplication in a unique index (same as
before). Plus the GUC still doesn't affect unique indexes, unique
index deduplication still isn't really documented in the user docs (it
just gets a passing mention in B-Tree internals section), etc.

In my opinion, the patch is now pretty close to being committable. I
do have two outstanding open items for the patch, though. These items
are:

* We still need infrastructure that marks B-Tree opclasses as safe for
deduplication, to avoid things like the numeric display scale problem,
collations that are unsafe for deduplication because they're
nondeterministic, etc.

I talked to Anastasia about this over private e-mail recently. This is
going well; I'm expecting a revision later this week. It will be based
on all feedback to date over on the other thread [1]/messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan that we have for
this part of the project.

* Make VACUUM's WAL record more space efficient when it contains one
or more "updates" to an existing posting list tuple.

Currently, when VACUUM must delete some but not all TIDs from a
posting list, we generate a new posting list tuple and dump it into
the WAL stream -- the REDO routine simply overwrites the existing item
with a version lacking the TIDs that have to go. This could be space
inefficient with certain workloads, such as workloads where only one
or two TIDs are deleted from a very large posting list tuple again and
again. Heikki suggested I do something about this. I intend to at
least research the problem, and can probably go ahead with
implementing it without any trouble.

What nbtree VACUUM does in the patch right now is roughly the same as
what GIN's VACUUM does for posting lists within posting tree pages --
see ginVacuumPostingTreeLeaf() (we're horribly inefficient about WAL
logging when VACUUM'ing a GIN entry tree leaf page, which works
differently, and isn't what I'm talking about -- see
ginVacuumEntryPage()). We might as well do better than
GIN/ginVacuumPostingTreeLeaf() here if we can.

The patch is pretty clever about minimizing the volume of WAL in all
other contexts, managing to avoid any other case of what could be
described as "WAL space amplification". Maybe we should do the same
with the xl_btree_vacuum record just to be *consistent* about it.

[1]: /messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v31-0002-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/x-patch; name=v31-0002-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From d2caa91af80b9564a88fb806d63ad10cbd530ace Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v31 2/3] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 118 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   7 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  53 ++++++++
 doc/src/sgml/pageinspect.sgml                 |  83 ++++++------
 4 files changed, 206 insertions(+), 55 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..1b2ea14122 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -510,7 +585,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[8];
+	char	   *values[9];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -557,17 +632,20 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get values of extended metadata if available, use default values
-	 * otherwise.
+	 * otherwise.  Note that we rely on the assumption that btm_safededup is
+	 * initialized to zero on databases that were initdb'd before Postgres 13.
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
+		values[j++] = metad->btm_safededup ? "t" : "f";
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
+		values[j++] = "f";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..92d5c59654 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -12,6 +12,7 @@ fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
+safededup               | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
@@ -41,6 +42,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +58,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..93ea37cde3 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,56 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int4,
+    OUT level int4,
+    OUT fastroot int4,
+    OUT fastlevel int4,
+    OUT oldest_xact int4,
+    OUT last_cleanup_num_tuples real,
+    OUT safededup boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..b527daf6ca 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -300,13 +300,14 @@ test=# SELECT t_ctid, raw_flags, combined_flags
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 582
 last_cleanup_num_tuples | 1000
+safededup               | f
 </screen>
      </para>
     </listitem>
@@ -329,11 +330,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +357,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +415,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v31-0003-DEBUG-Show-index-values-in-pageinspect.patchapplication/x-patch; name=v31-0003-DEBUG-Show-index-values-in-pageinspect.patchDownload

From ebe0eb9492a21dc9acda51ca77df0b720055fa8d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v31 3/3] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 65 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 1b2ea14122..fc1252455d 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +263,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 92d5c59654..fc6794ef65 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,7 +41,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v31-0001-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v31-0001-Add-deduplication-to-nbtree.patchDownload

From 5db7d3c506ced15cac1bce505028b12072f9eb1f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 25 Jan 2020 14:40:46 -0800
Subject: [PATCH v31 1/3] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
"posting list tuples" are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.

The lazy approach taken by nbtree has significant advantages over a
GIN style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.
Furthermore, deduplication can easily be turned on or off.  The split
point choice logic doesn't need to be taught about posting lists, since
posting list tuples are just tuples with payload, much like tuples with
non-key columns in INCLUDE indexes.  (nbtsplitloc.c is taught about
posting lists, but this is only an optimization.)

In general, nbtree unique indexes sometimes need to store multiple equal
(non-NULL) tuples for the same logical row (one per physical row
version).  Unique indexes can use deduplication specifically to merge
together multiple physical versions (index tuples).  The high-level goal
with unique indexes is to prevent "unnecessary" page splits -- splits
caused only by a short term burst of index tuple versions.  This is
often a concern with frequently updated tables where UPDATEs always
modify at least one indexed column (making it impossible for the table
am to use an optimization like heapam's heap-only tuples optimization).
Deduplication in unique indexes effectively "buys time" for existing
nbtree garbage collection mechanisms to run and prevent these page
splits.

Deduplication in non-unique indexes is controlled by a new GUC,
deduplicate_btree_items.  A new storage parameter (deduplicate_items) is
also added, which controls deduplication at the index relation
granularity.  It can be used to disable deduplication in unique indexes
for debugging purposes. (The general criteria for applying deduplication
in unique indexes ensures that only cases with some duplicates will
actually get a deduplication pass -- that's why unique indexes are not
affected by the deduplicate_btree_items GUC.)

Since posting list tuples have only one line pointer (just like any
other tuple), they have only one LP_DEAD bit.  The LP_DEAD bit can still
be set by both unique checking and the kill_prior_tuple optimization,
but only when all heap TIDs are dead-to-all.  This "loss of granularity"
for LP_DEAD bits is considered an acceptable downside of the
deduplication design.  We always prefer deleting LP_DEAD items to a
deduplication pass, and a deduplication pass can only take place at the
point where we'd previously have had to split the page, so any workload
that pays a cost here must also get a significant benefit.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since deduplication only affects the physical
representation of tuples.  However, users must still REINDEX a
pg_upgrade'd index to before its leaf page splits will apply
deduplication.  An index build is the only way to set the new nbtree
metapage flag indicating that deduplication is generally safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h                   | 421 ++++++++--
 src/include/access/nbtxlog.h                  |  97 ++-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 133 ++-
 src/backend/access/nbtree/nbtdedup.c          | 759 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 397 +++++++--
 src/backend/access/nbtree/nbtpage.c           | 223 ++++-
 src/backend/access/nbtree/nbtree.c            | 180 ++++-
 src/backend/access/nbtree/nbtsearch.c         | 271 ++++++-
 src/backend/access/nbtree/nbtsort.c           | 190 ++++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 226 +++++-
 src/backend/access/nbtree/nbtxlog.c           | 235 +++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  22 +-
 src/backend/storage/page/bufpage.c            |   9 +-
 src/backend/utils/misc/guc.c                  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 231 +++++-
 doc/src/sgml/btree.sgml                       | 120 ++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/config.sgml                      |  25 +
 doc/src/sgml/ref/create_index.sgml            |  38 +-
 src/test/regress/expected/btree_index.out     |  16 +
 src/test/regress/sql/btree_index.sql          |  17 +
 28 files changed, 3369 insertions(+), 320 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 20ace69dab..075e3ae623 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,9 @@
 #include "storage/bufmgr.h"
 #include "storage/shm_toc.h"
 
+/* GUC parameter */
+extern bool deduplicate_btree_items;
+
 /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
 typedef uint16 BTCycleId;
 
@@ -108,6 +111,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -124,6 +128,13 @@ typedef struct BTMetaPageData
  * need to be immediately re-indexed at pg_upgrade.  In order to get the
  * new heapkeyspace semantics, however, a REINDEX is needed.
  *
+ * Deduplication is safe to use when the btm_safededup field is set to
+ * true.  It's safe to read the btm_safededup field on version 3, but only
+ * version 4 indexes make use of deduplication.  Even version 4 indexes
+ * created on PostgreSQL v12 will need a REINDEX to make use of
+ * deduplication, though, since there is no other way to set btm_safededup
+ * to true (pg_upgrade hasn't been taught to set the metapage field).
+ *
  * Btree version 2 is mostly the same as version 3.  There are two new
  * fields in the metapage that were introduced in version 3.  A version 2
  * metapage will be automatically upgraded to version 3 on the first
@@ -156,6 +167,21 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
+ * that may be stored on a btree leaf page.  It is used to size the
+ * per-page temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-tuple overheads here to keep
+ * things simple (value is based on how many elements a single array of
+ * heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.
+ */
+#define MaxTIDsPerBTreePage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +256,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -264,7 +289,8 @@ typedef struct BTMetaPageData
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
  * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
- * t_info is always set on BTREE_VERSION 4 pivot tuples.
+ * t_info is always set on BTREE_VERSION 4 pivot tuples, since
+ * BTreeTupleIsPivot() must work reliably on heapkeyspace versions.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -279,90 +305,256 @@ typedef struct BTMetaPageData
  * The 12 least significant offset bits from t_tid are used to represent
  * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
- * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
- * number of columns/attributes <= INDEX_MAX_KEYS.
+ * future use.  BT_OFFSET_MASK should be large enough to store any number
+ * of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  PostgreSQL v13 introduced a
+ * new non-pivot tuple format to support deduplication: posting list
+ * tuples.  Deduplication merges together multiple equal non-pivot tuples
+ * into a logically equivalent, space efficient representation.  A posting
+ * list is an array of ItemPointerData elements.  Non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).  BT_OFFSET_MASK should be large enough to store
+ * any number of posting list TIDs that might be present in a tuple (since
+ * tuple size is subject to the INDEX_SIZE_MASK limit).
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
-#define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_OFFSET_MASK				0x0FFF
 #define BT_PIVOT_HEAP_TID_ATTR		0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined as a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
-			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_PIVOT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_PIVOT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -434,6 +626,9 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use deduplication safely.
+ * This is also a property of the index relation rather than an indexscan.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +664,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -507,10 +703,74 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * State used to representing an individual pending tuple during
+ * deduplication.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} BTDedupInterval;
+
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal items
+ * from the page, starting from page offset number 'baseoff'.  This is the
+ * offset number of the "base" tuple for new posting list.  'nitems' is the
+ * current total number of existing items from the page that will be merged to
+ * make a new posting list tuple, including the base tuple item.  (Existing
+ * items may themselves be posting list tuples, or regular non-pivot tuples.)
+ *
+ * The total size of the existing tuples to be freed when pending posting list
+ * is processed gets tracked by 'phystupsize'.  This information allows
+ * deduplication to calculate the space saving for each new posting list
+ * tuple, and for the entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	bool		deduplicate;	/* Still deduplicating page? */
+	Size		maxpostingsize; /* Limit on size of final tuple */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without original posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing tuples/line pointers */
+	Size		phystupsize;	/* Includes line pointer overhead */
+
+	/*
+	 * Array of tuples to go on new version of the page.  Contains one entry
+	 * for each group of consecutive items.  Note that existing tuples that
+	 * will not become posting list tuples do not appear in the array (they
+	 * are implicitly unchanged by deduplication pass).
+	 */
+	int			nintervals;		/* current size of intervals array */
+	BTDedupInterval intervals[MaxIndexTuplesPerPage];
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -534,7 +794,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list
+ * tuple.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -578,7 +840,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -686,6 +948,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplicate_items;	/* Use deduplication where safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -694,8 +957,16 @@ typedef struct BTOptions
 	 (relation)->rd_options ? \
 	 ((BTOptions *) (relation)->rd_options)->fillfactor : \
 	 BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplicate_items : \
+	 BTGetUseDedupGUC(relation)))
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+	(relation->rd_index->indisunique || deduplicate_btree_items)
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -742,6 +1013,21 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber baseoff);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Page newpage, BTDedupState state);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -760,14 +1046,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -776,7 +1064,9 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								OffsetNumber *updatable, IndexTuple *updated,
+								int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
@@ -829,6 +1119,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..37d975de98 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP		0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +154,33 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents a deduplication pass for a leaf page.  An array
+ * of BTDedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+	uint16		nintervals;
+
+	/* DEDUPLICATION INTERVALS FOLLOW */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * the TIDs within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -152,19 +213,23 @@ typedef struct xl_btree_reuse_page
 /*
  * This is what we need to know about vacuum of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
- *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * single index page when executed by VACUUM.  It can also support "updates"
+ * of index tuples, which are how deletes of a subset of TIDs contained in an
+ * existing posting list tuple are implemented. (Updates are only used when
+ * there will be some remaining TIDs once VACUUM finishes; otherwise the
+ * posting list tuple can just be deleted).
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES FOR OVERWRITES FOLLOW */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +310,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..f2b03a6cfc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplicate_items",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..6499f5adb7 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every table TID within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,134 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.  Storing
+duplicates in a space efficient manner is not the goal, since in the long
+run there won't be any duplicates anyway.  Rather, we're buying time for
+standard garbage collection mechanisms to run before a page split is
+needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves a situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost/max
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..cb09f18e92
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,759 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+							 OffsetNumber minoff, IndexTuple newitem);
+static void _bt_singleval_fillfactor(Page page, BTDedupState state,
+									 Size newitemsz);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_posting_valid(IndexTuple posting);
+#endif
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.  This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible.  Note, however, that "single
+ * value" strategy is sometimes used for !checkingunique callers, in which
+ * case deduplication will leave a few tuples untouched at the end of the
+ * page.  The general idea is to prepare the page for an anticipated page
+ * split that uses nbtsplitloc's own "single value" strategy to determine a
+ * split point.  (There is no reason to deduplicate items that will end up on
+ * the right half of the page after the anticipated page split; better to
+ * handle those if and when the anticipated right half page gets its own
+ * deduplication pass, following further inserts of duplicates.)
+ *
+ * nbtinsert.c caller should call _bt_vacuum_one_page() before calling here
+ * when BTP_HAS_GARBAGE flag is set.  Note that this routine will delete all
+ * items on the page that have their LP_DEAD bit set, even when page's flag
+ * bit is not set (though that should be rare).  Caller can rely on that to
+ * avoid inserting a new tuple that happens to overlap with an existing
+ * posting list tuple with its LP_DEAD bit set. (Calling here with a newitemsz
+ * of 0 will reliably delete the existing item, making it possible to avoid
+ * unsetting the LP_DEAD bit just to insert the new item.  In general, posting
+ * list splits should never have to deal with a posting list tuple with its
+ * LP_DEAD bit set.)
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	Page		newpage;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	int			ndeletable = 0;
+	int			pagenitems = 0;
+	Size		pagesaving = 0;
+	bool		singlevalstrat = false;
+
+	/*
+	 * Caller should call _bt_vacuum_one_page() before calling here when it
+	 * looked like there were LP_DEAD items on the page.  However, we can't
+	 * assume that there are no LP_DEAD items (for one thing, VACUUM will
+	 * clear the BTP_HAS_GARBAGE hint without reliably removing items that are
+	 * marked LP_DEAD).  We must be careful to clear all LP_DEAD items because
+	 * posting list splits cannot go ahead if an existing posting list item
+	 * has its LP_DEAD bit set. (Also, we don't want to unnecessarily unset
+	 * LP_DEAD bits when deduplicating items on the page below, though that
+	 * should be harmless.)
+	 *
+	 * The opposite problem is also possible: _bt_vacuum_one_page() won't
+	 * clear the BTP_HAS_GARBAGE bit when it is falsely set (i.e. when there
+	 * are no LP_DEAD bits).  This probably doesn't matter in practice, since
+	 * it's only a hint, and VACUUM will clear it at some point anyway.  Even
+	 * still, we clear the BTP_HAS_GARBAGE hint reliably here. (Seems like a
+	 * good idea for deduplication to only begin when we unambiguously have no
+	 * LP_DEAD items.)
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split using the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+	else if (P_HAS_GARBAGE(opaque))
+	{
+		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		MarkBufferDirtyHint(buf, true);
+	}
+
+	/*
+	 * Return early in case where caller just wants us to kill an existing
+	 * LP_DEAD posting list tuple
+	 */
+	Assert(!P_HAS_GARBAGE(opaque));
+	if (newitemsz == 0)
+		return;
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 *
+	 * It would be possible for maxpostingsize (limit on posting list tuple
+	 * size) to be set to one third of the page.  However, it seems like a
+	 * good idea to limit the size of posting lists to one sixth of a page.
+	 * That ought to leave us with a good split point when pages full of
+	 * duplicates can be split several times.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->deduplicate = true;
+	state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = palloc(state->maxpostingsize);
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+	/* nintervals should be initialized to zero */
+	state->nintervals = 0;
+
+	/* Determine if "single value" strategy should be used */
+	if (!checkingunique)
+		singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+	/*
+	 * Deduplicate items from page, and write them to newpage.
+	 *
+	 * Copy the original page's LSN into newpage copy.  This will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	PageSetLSN(newpage, PageGetLSN(page));
+
+	/* Copy high key, if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (offnum == minoff)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (state->deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed current
+			 * maxpostingsize).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and actually update the page.  Else
+			 * reset the state and move on without modifying the page.
+			 */
+			pagesaving += _bt_dedup_finish_pending(newpage, state);
+			pagenitems++;
+
+			if (singlevalstrat)
+			{
+				/*
+				 * Single value strategy's extra steps.
+				 *
+				 * Lower maxpostingsize for sixth and final item that might be
+				 * deduplicated by current deduplication pass.  When sixth
+				 * item formed/observed, stop deduplicating items.
+				 *
+				 * Note: It's possible that this will be reached even when
+				 * current deduplication pass has yet to merge together some
+				 * existing items.  It doesn't matter whether or not the
+				 * current call generated the maxpostingsize-capped duplicate
+				 * tuples at the start of the page.
+				 */
+				if (pagenitems == 5)
+					_bt_singleval_fillfactor(page, state, newitemsz);
+				else if (pagenitems == 6)
+				{
+					state->deduplicate = false;
+					singlevalstrat = false; /* won't be back here */
+				}
+			}
+
+			/* itup starts new pending posting list */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_finish_pending(newpage, state);
+	pagenitems++;
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (state->nintervals == 0)
+	{
+		pfree(newpage);
+		pfree(state->htids);
+		pfree(state);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.nintervals = state->nintervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/*
+		 * The intervals array is not in the buffer, but pretend that it is.
+		 * When XLogInsert stores the whole buffer, the array need not be
+		 * stored too.
+		 */
+		XLogRegisterBufData(0, (char *) state->intervals,
+							state->nintervals * sizeof(BTDedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's base tuple.
+ *
+ * Every tuple processed by deduplication either becomes the base tuple for a
+ * posting list, or gets its heap TID(s) accepted into a pending posting list.
+ * A tuple that starts out as the base tuple for a posting list will only
+ * actually be rewritten within _bt_dedup_finish_pending() when it turns out
+ * that there are duplicates that can be merged into the base tuple.
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TID(s) from new base tuple for new candidate posting list
+	 * into working state's array
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, &base->t_tid, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* basetupsize should not include existing posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain physical size of all existing tuples (including line
+	 * pointer overhead) so that we can calculate space savings on page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->intervals[state->nintervals].baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state now
+ * includes itup's heap TID(s).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over maxpostingsize limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxpostingsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Page newpage, BTDedupState state)
+{
+	IndexTuple	final;
+	Size		finalsz;
+	OffsetNumber finaloff;
+	Size		spacesaving;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+	finaloff = OffsetNumberNext(PageGetMaxOffsetNumber(newpage));
+	if (state->nitems == 1)
+	{
+		/* Use original, unchanged base tuple */
+		finalsz = IndexTupleSize(state->base);
+		if (PageAddItem(newpage, (Item) state->base, finalsz, finaloff,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		spacesaving = 0;
+	}
+	else
+	{
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		finalsz = IndexTupleSize(final);
+		Assert(finalsz <= state->maxpostingsize);
+
+		/* Save final number of items for posting list */
+		state->intervals[state->nintervals].nitems = state->nitems;
+
+		Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+		if (PageAddItem(newpage, (Item) final, finalsz, finaloff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		pfree(final);
+		spacesaving = state->phystupsize - (finalsz + sizeof(ItemIdData));
+		/* Increment nintervals, since we wrote a new posting list tuple */
+		state->nintervals++;
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied.  The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value.  When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split.  The first deduplication pass should only
+ * find regular non-pivot tuples.  Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over.  The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples.  The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive.  Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes.  Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ *
+ * Note: We deliberately don't bother checking if the high key is a distinct
+ * value (prior to the TID tiebreaker column) before proceeding, unlike
+ * nbtsplitloc.c.  Its single value strategy only gets applied on the
+ * rightmost page of duplicates of the same value (other leaf pages full of
+ * duplicates will get a simple 50:50 page split instead of splitting towards
+ * the end of the page).  There is little point in making the same distinction
+ * here.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+				 OffsetNumber minoff, IndexTuple newitem)
+{
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	ItemId		itemid;
+	IndexTuple	itup;
+
+	itemid = PageGetItemId(page, minoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+
+	if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+	{
+		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Lower maxpostingsize when using "single value" strategy, to avoid a sixth
+ * and final maxpostingsize-capped tuple.  The sixth and final posting list
+ * tuple will end up somewhat smaller than the first five.  (Note: The first
+ * five tuples could actually just be very large duplicate tuples that
+ * couldn't be merged together at all.  Deduplication will simply not modify
+ * the page when that happens.)
+ *
+ * When there are six posting lists on the page (after current deduplication
+ * pass goes on to create/observe a sixth very large tuple), caller should end
+ * its deduplication pass.  It isn't useful to try to deduplicate items that
+ * are supposed to end up on the new right sibling page following the
+ * anticipated page split.  A future deduplication pass of future right
+ * sibling page might take care of it.  (This is why the first single value
+ * strategy deduplication pass for a given leaf page will generally find only
+ * plain non-pivot tuples -- see _bt_do_singleval() comments.)
+ */
+static void
+_bt_singleval_fillfactor(Page page, BTDedupState state, Size newitemsz)
+{
+	Size		leftfree;
+	int			reduction;
+
+	/* This calculation needs to match nbtsplitloc.c */
+	leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+	/* Subtract size of new high key (includes pivot heap TID space) */
+	leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+	/*
+	 * Reduce maxpostingsize by an amount equal to target free space on left
+	 * half of page
+	 */
+	reduction = leftfree * ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+	if (state->maxpostingsize > reduction)
+		state->maxpostingsize -= reduction;
+	else
+		state->maxpostingsize = 0;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in htids array.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * The calculation used when new tuple has a posting list needs to match
+	 * the code used within _bt_dedup_save_htid().
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(_bt_posting_valid(itup));
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+		Assert(ItemPointerIsValid(&itup->t_tid));
+	}
+
+	return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should pass their own private
+ * copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This generally happens inside the same
+ * critical section that performs an in-place update of old posting list using
+ * new posting list returned here).
+ *
+ * While the keys from newitem and oposting must be opclass equal, and must
+ * generate identical output when run through the underlying type's output
+ * function, it doesn't follow that their representations match exactly.
+ * Caller must avoid assuming that there can't be representational differences
+ * that make datums from oposting bigger or smaller than the corresponding
+ * datums from newitem.  For example, differences in TOAST input state might
+ * break a faulty assumption about tuple size (the executor is entitled to
+ * apply TOAST compression based on its own criteria).  It also seems possible
+ * that further representational variation will be introduced in the future,
+ * in order to support nbtree features like page-level prefix compression.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *replaceposright;
+	Size		nmovebytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(_bt_posting_valid(oposting));
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID.  We shift TIDs one place to the right, losing original
+	 * rightmost TID. (nmovebytes must not include TIDs to the left of
+	 * postingoff, nor the existing rightmost/max TID that gets overwritten.)
+	 */
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);
+	nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+	memmove(replaceposright, replacepos, nmovebytes);
+
+	/* Fill the gap at postingoff with TID of new item (original new TID) */
+	Assert(!BTreeTupleIsPivot(newitem) && !BTreeTupleIsPosting(newitem));
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Now copy oposting's rightmost/max TID into new item (final new TID) */
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(oposting), &newitem->t_tid);
+
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(_bt_posting_valid(nposting));
+
+	return nposting;
+}
+
+/*
+ * Verify posting list invariants for "posting", which must be a posting list
+ * tuple.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_posting_valid(IndexTuple posting)
+{
+	ItemPointerData last;
+	ItemPointer htid;
+
+	if (!BTreeTupleIsPosting(posting) || BTreeTupleGetNPosting(posting) < 2)
+		return false;
+
+	/* Remember first heap TID for loop */
+	ItemPointerCopy(BTreeTupleGetHeapTID(posting), &last);
+	if (!ItemPointerIsValid(&last))
+		return false;
+
+	/* Iterate, starting from second TID */
+	for (int i = 1; i < BTreeTupleGetNPosting(posting); i++)
+	{
+		htid = BTreeTupleGetPostingN(posting, i);
+
+		if (!ItemPointerIsValid(htid))
+			return false;
+		if (ItemPointerCompare(htid, &last) <= 0)
+			return false;
+		ItemPointerCopy(htid, &last);
+	}
+
+	return true;
+}
+#endif
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4e5849ab8e..d25efc8c3a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
+/* GUC parameter */
+bool		deduplicate_btree_items = true;
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +129,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -295,7 +300,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -340,6 +345,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 uint32 *speculativeToken)
 {
 	IndexTuple	itup = insertstate->itup;
+	IndexTuple	curitup;
+	ItemId		curitemid;
 	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
 	OffsetNumber offset;
@@ -348,6 +355,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -375,13 +385,21 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
-		ItemId		curitemid;
-		IndexTuple	curitup;
-		BlockNumber nblkno;
-
 		/*
-		 * make sure the offset points to an actual item before trying to
-		 * examine it...
+		 * Each iteration of the loop processes one heap TID, not one index
+		 * tuple.  Current offset number for page isn't usually advanced on
+		 * iterations that process heap TIDs from posting list tuples.
+		 *
+		 * "inposting" state is set when _inside_ a posting list --- not when
+		 * we're at the start (or end) of a posting list.  We advance curposti
+		 * at the end of the iteration when inside a posting list tuple.  In
+		 * general, every loop iteration either advances the page offset or
+		 * advances curposti --- an iteration that handles the rightmost/max
+		 * heap TID in a posting list finally advances the page offset (and
+		 * unsets "inposting").
+		 *
+		 * Make sure the offset points to an actual index tuple before trying
+		 * to examine it...
 		 */
 		if (offset <= maxoff)
 		{
@@ -406,31 +424,60 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				break;
 			}
 
-			curitemid = PageGetItemId(page, offset);
-
 			/*
-			 * We can skip items that are marked killed.
+			 * We can skip items that are already marked killed.
 			 *
 			 * In the presence of heavy update activity an index may contain
 			 * many killed items with the same key; running _bt_compare() on
 			 * each killed item gets expensive.  Just advance over killed
 			 * items as quickly as we can.  We only apply _bt_compare() when
-			 * we get to a non-killed item.  Even those comparisons could be
-			 * avoided (in the common case where there is only one page to
-			 * visit) by reusing bounds, but just skipping dead items is fast
-			 * enough.
+			 * we get to a non-killed item.  We could reuse the bounds to
+			 * avoid _bt_compare() calls for known equal tuples, but it
+			 * doesn't seem worth it.  Workloads with heavy update activity
+			 * tend to have many deduplication passes, so we'll often avoid
+			 * most of those comparisons, too (we call _bt_compare() when the
+			 * posting list tuple is initially encountered, though not when
+			 * processing later TIDs from the same tuple).
 			 */
-			if (!ItemIdIsDead(curitemid))
+			if (!inposting)
+				curitemid = PageGetItemId(page, offset);
+			if (inposting || !ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
 				bool		all_dead;
 
-				if (_bt_compare(rel, itup_key, page, offset) != 0)
-					break;		/* we're past all the equal tuples */
+				if (!inposting)
+				{
+					/* Plain tuple, or first TID in posting list tuple */
+					if (_bt_compare(rel, itup_key, page, offset) != 0)
+						break;	/* we're past all the equal tuples */
 
-				/* okay, we gotta fetch the heap tuple ... */
-				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+					/* Advanced curitup */
+					curitup = (IndexTuple) PageGetItem(page, curitemid);
+					Assert(!BTreeTupleIsPivot(curitup));
+				}
+
+				/* okay, we gotta fetch the heap tuple using htid ... */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					/* ... htid is from simple non-pivot tuple */
+					Assert(!inposting);
+					htid = curitup->t_tid;
+				}
+				else if (!inposting)
+				{
+					/* ... htid is first TID in new posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+					htid = *BTreeTupleGetPostingN(curitup, 0);
+				}
+				else
+				{
+					/* ... htid is second or subsequent TID in posting list */
+					Assert(curposti > 0);
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
+				}
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -506,8 +553,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -565,12 +611,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -584,14 +632,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -606,7 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
 			{
-				nblkno = opaque->btpo_next;
+				BlockNumber nblkno = opaque->btpo_next;
+
 				nbuf = _bt_relandgetbuf(rel, nbuf, nblkno, BT_READ);
 				page = BufferGetPage(nbuf);
 				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -616,6 +680,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			/* Will also advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -684,6 +751,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -699,6 +767,8 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		bool		dedupunique = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -712,9 +782,25 @@ _bt_findinsertloc(Relation rel,
 		 * tuple belongs on.  The heap TID attribute for new tuple (scantid)
 		 * could force us to insert on a sibling page, though that should be
 		 * very rare in practice.
+		 *
+		 * checkingunique inserters that encounter a duplicate will apply
+		 * deduplication when it looks like there will be a page split, but
+		 * there is no LP_DEAD garbage on the leaf page to vacuum away (or
+		 * there wasn't enough space freed by LP_DEAD cleanup).  This
+		 * complements the opportunistic LP_DEAD vacuuming mechanism.  The
+		 * high level goal is to avoid page splits caused by new, unchanged
+		 * versions of existing logical rows altogether.  See nbtree/README
+		 * for full details.
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				dedupunique = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -741,18 +827,37 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (helpful when initial page is empty) */
+				dedupunique = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, try to obtain
+		 * enough free space to avoid a page split by deduplicating existing
+		 * items (if deduplication is safe).
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates if checkingunique */
+				dedupunique = true;
+			}
+
+			if (itup_key->safededup && BTGetUseDedup(rel) &&
+				PageGetFreeSpace(page) < insertstate->itemsz &&
+				(!checkingunique || dedupunique))
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -834,7 +939,36 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  _bt_insertonpg() cannot handle this, so delete all LP_DEAD
+		 * items early.  This is the only case where LP_DEAD deletes happen
+		 * even though a page split wouldn't take place if we went straight to
+		 * the _bt_insertonpg() call.
+		 *
+		 * Call _bt_dedup_one_page() instead of _bt_vacuum_one_page() to force
+		 * deletes (this avoids relying on the BTP_HAS_GARBAGE hint flag,
+		 * which might be falsely unset).  Call can't actually dedup items,
+		 * since we pass a newitemsz of 0.
+		 */
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+						   0, true);
+
+		/*
+		 * Do new binary search.  New insert location cannot overlap with any
+		 * posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -900,10 +1034,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -931,11 +1067,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -949,6 +1089,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -959,6 +1100,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->safededup);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost/max TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -991,7 +1160,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,6 +1236,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1115,8 +1288,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1139,6 +1323,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1147,7 +1332,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1189,6 +1394,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1204,12 +1417,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1229,6 +1454,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1298,6 +1524,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1335,6 +1589,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1368,6 +1624,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1383,6 +1641,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1447,6 +1706,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1475,8 +1735,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1645,8 +1913,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1665,11 +1937,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1829,7 +2125,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2185,6 +2481,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2265,7 +2562,7 @@ _bt_pgaddtup(Page page,
 static void
 _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 {
-	OffsetNumber deletable[MaxOffsetNumber];
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				minoff,
@@ -2298,6 +2595,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..72b3921119 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +674,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +702,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -964,28 +990,88 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with deleting TIDs when some (but not all) of the heap TIDs
+ * in an existing posting list item are to be removed by VACUUM.  This works
+ * by updating/overwriting an existing item with caller's new version of the
+ * item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					OffsetNumber *updatable, IndexTuple *updated,
+					int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	IndexTuple	itup;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen = 0;
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			itup = updated[i];
+			itemsz = MAXALIGN(IndexTupleSize(itup));
+			memcpy(updatedbuf + offset, itup, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber offnum = updatable[i];
+
+		itup = updated[i];
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+
+		if (!PageIndexTupleOverwrite(page, offnum, (Item) itup, itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1092,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1107,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatable,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1130,10 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
 }
 
 /*
@@ -1050,6 +1146,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire tuple can be
+ * deleted as a whole (since there is only one LP_DEAD bit per line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1163,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1210,83 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2232,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5254bc7ef5..4227d2e3be 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+									  int *nremaining);
 
 
 /*
@@ -161,7 +163,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -264,8 +266,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxTIDsPerBTreePage * sizeof(int));
+				if (so->numKilled < MaxTIDsPerBTreePage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1154,11 +1156,16 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		IndexTuple	updated[MaxIndexTuplesPerPage];
+		OffsetNumber updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1190,8 +1197,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1199,11 +1209,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1226,22 +1234,86 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					ItemPointer newhtids;
+					int			nremaining;
+
+					/* Posting list tuple */
+					newhtids = btreevacuumposting(vstate, itup, &nremaining);
+					if (newhtids == NULL)
+					{
+						/*
+						 * All table TIDs from the posting tuple remain, so no
+						 * delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+						IndexTuple	updatedtuple;
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this new tuple and the offset of the tuple
+						 * to be updated for the page's _bt_delitems_vacuum()
+						 * call.
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatedtuple = _bt_form_posting(itup, newhtids,
+														nremaining);
+						updated[nupdatable] = updatedtuple;
+						updatable[nupdatable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+						pfree(newhtids);
+					}
+					else
+					{
+						/*
+						 * All table TIDs from the posting list must be
+						 * deleted.  We'll delete the index tuple completely
+						 * (no update).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(newhtids);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								updated, nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updated[i]);
 		}
 		else
 		{
@@ -1254,6 +1326,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1263,15 +1336,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
-		 * freePages out-of-order (doesn't seem worth any extra code to handle
-		 * the case).
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as separate live tuples).
+		 * We don't delete when recursing, though, to avoid putting entries
+		 * into freePages out-of-order (doesn't seem worth any extra code to
+		 * handle the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1303,9 +1379,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1314,6 +1391,67 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns new palloc'd array of item pointers needed to build
+ * replacement posting list tuple without the TIDs that VACUUM needs to
+ * delete.  Returned value is NULL in the common case no changes are
+ * needed in caller's posting list tuple (we avoid allocating memory
+ * here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is
+ * set for caller in *nremaining.  This indicates the number of elements
+ * in the returned array (assuming that return value isn't just NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple posting, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(posting);
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/*
+			 * Live table TID.
+			 *
+			 * Only save live TID when we already know that we're going to
+			 * have to kill at least one TID, and have already allocated
+			 * memory.
+			 */
+			if (tmpitems)
+				tmpitems[live] = items[i];
+			live++;
+		}
+		else if (tmpitems == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining an array of live TIDs for caller to
+			 * reconstruct smaller replacement posting list tuple
+			 */
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+			/* Copy live TIDs skipped in previous iterations, if any */
+			if (live > 0)
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+		}
+	}
+
+	*nremaining = live;
+	return tmpitems;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..c60f3bd6a0 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->safededup);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +693,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * Scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * with scantid.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxTIDsPerBTreePage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..a40c4dd060 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple has a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  Even still, the lastleft and firstright
+			 * tuples passed to _bt_truncate() here are at least not fully
+			 * equal to each other when deduplication is used, unless there is
+			 * a large group of duplicates (also, unique index builds usually
+			 * have few or no spool2 duplicates).  When the split point is
+			 * between two unequal tuples, _bt_truncate() will avoid including
+			 * a heap TID in the new high key, which is the most important
+			 * benefit of suffix truncation.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1007,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1069,43 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	Assert(dstate->nitems > 0);
+
+	if (dstate->nitems == 1)
+		_bt_buildadd(wstate, state, dstate->base, 0);
+	else
+	{
+		IndexTuple	postingtuple;
+		Size		truncextra;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		/* Calculate posting list overhead */
+		truncextra = IndexTupleSize(postingtuple) -
+			BTreeTupleGetPostingOffset(postingtuple);
+
+		_bt_buildadd(wstate, state, postingtuple, truncextra);
+		pfree(postingtuple);
+	}
+
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1151,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1172,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1194,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1293,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1308,100 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->deduplicate = true; /* unused */
+		dstate->maxpostingsize = 0; /* set later */
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+		dstate->nintervals = 0; /* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to 1/10 space we want to
+				 * leave behind on the page, plus space for final item's line
+				 * pointer.  This is equal to the space that we'd like to
+				 * leave behind on each leaf page when fillfactor is 90,
+				 * allowing us to get close to fillfactor% space utilization
+				 * when there happen to be a great many duplicates.  (This
+				 * makes higher leaf fillfactor settings ineffective when
+				 * building indexes that have many duplicates, but packing
+				 * leaf pages full with few very large tuples doesn't seem
+				 * like a useful goal.)
+				 */
+				dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+					sizeof(ItemIdData);
+				Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+					   dstate->maxpostingsize <= INDEX_SIZE_MASK);
+				dstate->htids = palloc(dstate->maxpostingsize);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list.  Heap
+				 * TID from itup has been saved in state.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * _bt_dedup_save_htid() opted to not merge current item into
+				 * pending posting list.
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				pfree(dstate->base);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		if (state)
+		{
+			/*
+			 * Handle the last item (there must be a last item when the
+			 * tuplesort returned one or more tuples)
+			 */
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1409,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..5ed09640ad 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplicate_items", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplicate_items)}
 
 	};
 
@@ -2118,11 +2184,10 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
@@ -2138,6 +2203,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2232,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2251,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(firstright))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include the max lastleft
+			 * heap TID in the final pivot tuple, but we can remove the
+			 * posting list now. (Pivot tuples should never contain a posting
+			 * list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2291,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2302,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2320,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2329,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2411,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not deduplication-safe.  This weaker
+ * guarantee is good enough for nbtsplitloc.c caller, since false negatives
+ * generally only have the effect of making leaf page splits use a more
+ * balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2392,28 +2492,42 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * Mask allocated for number of keys in index tuple must be able to fit
 	 * maximum possible number of index attributes
 	 */
-	StaticAssertStmt(BT_N_KEYS_OFFSET_MASK >= INDEX_MAX_KEYS,
-					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
+	StaticAssertStmt(BT_OFFSET_MASK >= INDEX_MAX_KEYS,
+					 "BT_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2571,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2602,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2676,53 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the storage parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..582c8fd95e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,90 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		OffsetNumber offnum,
+					minoff,
+					maxoff;
+		BTDedupState state;
+		BTDedupInterval *intervals;
+		Page		newpage;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		state->deduplicate = true;	/* unused */
+		/* Conservatively use larger maxpostingsize than primary */
+		state->maxpostingsize = BTMaxItemSize(page);
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = palloc(state->maxpostingsize);
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+		state->nintervals = 0;
+
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		intervals = (BTDedupInterval *) ((char *) xlrec + SizeOfBtreeDedup);
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == minoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (state->nintervals < xlrec->nintervals &&
+					 state->baseoff == intervals[state->nintervals].baseoff &&
+					 state->nitems < intervals[state->nintervals].nitems)
+			{
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+			else
+			{
+				_bt_dedup_finish_pending(newpage, state);
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+		}
+
+		_bt_dedup_finish_pending(newpage, state);
+		Assert(state->nintervals == xlrec->nintervals);
+		Assert(memcmp(state->intervals, intervals,
+					  state->nintervals * sizeof(BTDedupInterval)) == 0);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +567,31 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			IndexTuple	updated;
+			Size		itemsz;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updated = (IndexTuple) ((char *) updatedoffsets +
+									xlrec->nupdated * sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				itemsz = MAXALIGN(IndexTupleSize(updated));
+
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) updated, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				updated = (IndexTuple) ((char *) updated + itemsz);
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +910,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +933,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +961,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..7bbe55c5cf 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,24 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "nintervals %u", xlrec->nintervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +140,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP:
+			id = "DEDUP";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f47176753d..32ff03b3e4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1055,8 +1055,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1142,7 +1144,8 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	}
 
 	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cacbe904db..71ef71de3b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -1096,6 +1097,15 @@ static struct config_bool ConfigureNamesBool[] =
 		false,
 		check_bonjour, NULL, NULL
 	},
+	{
+		{"deduplicate_btree_items", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Enables B-tree index deduplication optimization."),
+			NULL
+		},
+		&deduplicate_btree_items,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
 			gettext_noop("Collects transaction commit time."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e1048c0047..b3a98345fa 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -652,6 +652,7 @@
 #vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
 						# before index cleanup, 0 always performs
 						# index cleanup
+#deduplicate_btree_items = on
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index dc03fbde13..b6b08d0ccb 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1731,14 +1731,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplicate_items",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplicate_items =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..359b5c18dc 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_plain_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
+		 * bt_posting_plain_tuple() for definition, and details of how posting
+		 * list tuples are handled.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +927,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +958,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1000,20 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * TID, since the posting list itself is validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1025,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is a posting list tuple, make sure posting list TIDs are
+		 * in order
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1091,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1117,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "plain" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_plain_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1150,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1191,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && BTreeTupleIsPosting(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1219,8 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		/* Reset, in case scantid was set to (itup) posting tuple's max TID */
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2027,9 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2042,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2107,29 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "plain" tuple for nth posting list entry/TID.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple index tuples are merged together into one equivalent
+ * posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "plain"
+ * tuples.  Each tuple must be fingerprinted separately -- there must be one
+ * tuple for each corresponding Bloom filter probe during the heap scan.
+ *
+ * Note: Caller still needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_plain_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2186,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2194,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2650,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..da7007135d 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,127 @@ returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  Internally, a B-tree index consists of a tree structure with leaf
+  pages.  Each leaf page contains tuples that point to table entries
+  using a heap item pointer.  Each tuple's key is considered unique
+  internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+  An introduction to the btree index implementation can be found in
+  <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   B-Tree supports <firstterm>deduplication</firstterm>.  Existing
+   leaf page tuples with fully equal keys (equal prior to the heap
+   item pointer) are merged together into a single <quote>posting
+   list</quote> tuple.  The keys appear only once in this
+   representation.  A simple array of heap item pointers follows.
+   Posting lists are formed <quote>lazily</quote>, when a new item is
+   inserted that cannot fit on an existing leaf page.  The immediate
+   goal of the deduplication process is to at least free enough space
+   to fit the new item; otherwise a leaf page split occurs, which
+   allocates a new leaf page.  The <firstterm>key space</firstterm>
+   covered by the original leaf page is shared among the original page,
+   and its new right sibling page.
+  </para>
+  <para>
+   A duplicate is a row where <emphasis>all</emphasis> indexed key
+   columns are equal to the corresponding column values from some
+   other row.
+  </para>
+  <para>
+   Deduplication can greatly increase index space efficiency with data
+   sets where each distinct key appears at least a few times on
+   average.  It can also reduce the cost of subsequent index scans,
+   especially when many leaf pages must be accessed.  For example, an
+   index on a simple <type>integer</type> column that uses
+   deduplication will have a storage size that is only about 65% of an
+   equivalent unoptimized index when each distinct
+   <type>integer</type> value appears three times.  If each distinct
+   <type>integer</type> value appears six times, the storage overhead
+   can be as low as 50% of baseline.  With hundreds of duplicates per
+   distinct value (or with larger <quote>base</quote> key values), a
+   storage size of about one third of the unoptimized case is
+   expected.  There is usually a direct benefit for queries, as well
+   as an indirect benefit due to reduced I/O during routine vacuuming.
+  </para>
+  <para>
+   Cases that don't benefit due to having no duplicate values will
+   incur a small performance penalty with mixed read-write workloads.
+   There is no performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+  <title>Configuring Deduplication</title>
+
+  <para>
+   The <xref linkend="guc-btree-deduplicate-items"/> configuration
+   parameter controls deduplication.  By default, deduplication is
+   enabled.  The <literal>deduplicate_items</literal> storage
+   parameter can be used to override the configuration paramater for
+   individual indexes.  See <xref
+   linkend="sql-createindex-storage-parameters"/> from the
+   <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+  <title>Restrictions</title>
+
+  <para>
+   Deduplication can only be used with a B-Tree index when
+   <emphasis>all</emphasis> indexed columns use a deduplication-safe
+   operator class that explicitly indicates that deduplication is safe
+   at <command>CREATE INDEX</command> time.  In practice almost all
+   datatypes support deduplication.  <type>numeric</type> is a notable
+   exception (<quote>display scale</quote> makes it impossible to
+   enable deduplication without losing useful information about equal
+   <type>numeric</type> datums).  Some operator classes support
+   deduplication conditionally.  For example, deduplication of indexes
+   on a <type>text</type> column (with the default
+   <literal>btree/text_ops</literal> operator class) is not supported
+   when the column uses a  nondeterministic collation.
+  </para>
+  <para>
+   <literal>INCLUDE</literal> indexes do not support deduplication.
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+  <title>Internal use of Deduplication in unique indexes</title>
+
+  <para>
+   Page splits that occur due to inserting multiple physical versions
+   (rather than inserting new logical rows) tend to degrade the
+   structure of indexes, especially in the case of unique indexes.
+   Unique indexes use deduplication <emphasis>internally</emphasis>
+   and <emphasis>selectively</emphasis> to delay (and ideally to
+   prevent) these <quote>unnecessary</quote> page splits.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but usually cannot
+   reverse page splits (in general, the page must be completely empty
+   before <command>VACUUM</command> can <quote>delete</quote> it).
+  </para>
+  <para>
+   The <xref linkend="guc-btree-deduplicate-items"/> configuration
+   parameter does not affect whether or not deduplication is used
+   within unique indexes.  The internal use of deduplication for
+   unique indexes is subject to all of the same restrictions as
+   deduplication in general.  The <literal>deduplicate_items</literal>
+   storage parameter can be set to <literal>OFF</literal> to disable
+   deduplication in unique indexes, but this is intended only as a
+   debugging option for developers.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 057a6bb81a..20cdfabd7b 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e07dc01e80..864401ec32 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8043,6 +8043,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-btree-deduplicate-items" xreflabel="deduplicate_btree_items">
+      <term><varname>deduplicate_btree_items</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>deduplicate_btree_items</varname></primary>
+       <secondary>configuration parameter</secondary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls whether deduplication should be used within B-Tree
+        indexes.  Deduplication is an optimization that reduces the
+        storage size of indexes by storing equal index keys only once.
+        See <xref linkend="btree-deduplication"/> for more
+        information.
+       </para>
+
+       <para>
+        This setting can be overridden for individual B-Tree indexes
+        by changing index storage parameters.  See <xref
+        linkend="sql-createindex-storage-parameters"/> from the
+        <command>CREATE INDEX</command> documentation for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ab362a0dc5..af12566a87 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -393,10 +395,40 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
+    <term><literal>deduplicate_items</literal>
+     <indexterm>
+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Per-index value for <xref
+      linkend="guc-btree-deduplicate-items"/>.  Controls usage of the
+      B-tree deduplication technique described in <xref
+      linkend="btree-deduplication"/>.  Set to <literal>ON</literal>
+      or <literal>OFF</literal> to override GUC.  (Alternative
+      spellings of <literal>ON</literal> and <literal>OFF</literal>
+      are allowed as described in <xref linkend="config-setting"/>.)
+      The default is <literal>ON</literal>.
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplicate_items</literal> off via
+      <command>ALTER INDEX</command> prevents future insertions from
+      triggering deduplication, but does not in itself make existing
+      posting list tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -451,9 +483,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..3d353cefdf 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..b0b81b2b9a 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

#132

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#131)

3 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Tue, Jan 28, 2020 at 5:29 PM Peter Geoghegan <pg@bowt.ie> wrote:

In my opinion, the patch is now pretty close to being committable.

Attached is v32, which is even closer to being committable.

I do have two outstanding open items for the patch, though. These items
are:

* We still need infrastructure that marks B-Tree opclasses as safe for
deduplication, to avoid things like the numeric display scale problem,
collations that are unsafe for deduplication because they're
nondeterministic, etc.

No progress on this item for v32, though. It's now my only open item
for this entire project. Getting very close.

* Make VACUUM's WAL record more space efficient when it contains one
or more "updates" to an existing posting list tuple.

* I've focussed on this item in v32 -- it has been closed out. v32
doesn't explicitly WAL-log post-update index tuples during vacuuming
of posting list tuples, making the WAL records a lot smaller in some
cases.

v32 represents the posting list TIDs that must be deleted instead. It
does this in the most WAL-space-efficient manner possible: by storing
an array of uint16 offsets for each "updated" posting list within
xl_btree_vacuum records -- each entry in each array is an offset to
remove (i.e. a TID that should not appear in the updated version of
the tuple). We use a new nbtdedup.c utility function for this,
_bt_update_posting(). The new function is similar to its neighbor
function, _bt_swap_posting(), which is the nbtdedup.c utility function
used during posting list splits. Just like _bt_swap_posting(), we call
_bt_update_posting() both during the initial action, and from the REDO
routine that replays that action.

Performing vacuuming of posting list tuples this way seems to matter
with larger databases that depend on deduplication to control bloat,
though I haven't taken the time to figure out exactly how much it
matters. I'm satisfied that this is worth having based on
microbenchmarks that measure WAL volume using pg_waldump. One
microbenchmark showed something like a 10x decrease in the size of all
xl_btree_vacuum records taken together compared to v31.

I'm pretty sure that v32 makes it all but impossible for deduplication
to write out more WAL than an equivalent case with deduplication
disabled (I'm excluding FPIs here, of course -- full_page_writes=on
cases will see significant benefits from reduced FPIs, simply by
having fewer index pages). The per-leaf-page WAL record header
accounts for a lot of the space overhead of xl_btree_vacuum records,
and we naturally reduce that overhead when deduplicating, so we can
now noticeably come out ahead when it comes to overall WAL volume. I
wouldn't say that reducing WAL volume (other than FPIs) is actually a
goal of this project, but it might end up happening anyway. Apparently
Microsoft Azure PostgreSQL uses full_page_writes=off, so not everyone
cares about the number of FPIs (everyone cares about raw record size,
though).

* Removed the GUC that controls the use of deduplication in this new
version, per discussion with Robert over on the "Enabling B-Tree
deduplication by default" thread.

Perhaps we can get by with only an index storage parameter. Let's
defer this until after the Postgres 13 beta period is over, and we get
feedback from testers.

* Turned the documentation on deduplication in the B-Tree internals
chapter into a more general discussion of the on-disk format that
covers deduplication.

Deduplication enhances this on-disk representation, and discussing it
outside that wider context always felt awkward to me. Having this kind
of discussion in the docs seems like a good idea anyway.

--
Peter Geoghegan

Attachments:

v32-0002-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/x-patch; name=v32-0002-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From e8a84f3bb9c0c4633d72d86482c34fe948657908 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v32 2/3] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 118 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   7 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  53 ++++++++
 doc/src/sgml/pageinspect.sgml                 |  83 ++++++------
 4 files changed, 206 insertions(+), 55 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 564c818558..228e8912fc 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -510,7 +585,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[8];
+	char	   *values[9];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -557,17 +632,20 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get values of extended metadata if available, use default values
-	 * otherwise.
+	 * otherwise.  Note that we rely on the assumption that btm_safededup is
+	 * initialized to zero on databases that were initdb'd before Postgres 13.
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
+		values[j++] = metad->btm_safededup ? "t" : "f";
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
+		values[j++] = "f";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..92d5c59654 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -12,6 +12,7 @@ fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
+safededup               | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
@@ -41,6 +42,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +58,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..93ea37cde3 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,56 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int4,
+    OUT level int4,
+    OUT fastroot int4,
+    OUT fastlevel int4,
+    OUT oldest_xact int4,
+    OUT last_cleanup_num_tuples real,
+    OUT safededup boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..b527daf6ca 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -300,13 +300,14 @@ test=# SELECT t_ctid, raw_flags, combined_flags
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 582
 last_cleanup_num_tuples | 1000
+safededup               | f
 </screen>
      </para>
     </listitem>
@@ -329,11 +330,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +357,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +415,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v32-0003-DEBUG-Show-index-values-in-pageinspect.patchapplication/x-patch; name=v32-0003-DEBUG-Show-index-values-in-pageinspect.patchDownload

From 9e491a03ed8180e10f0e22bd321ae453964301bc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v32 3/3] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 64 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 228e8912fc..ef7584c70f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -245,6 +245,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +262,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +297,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +461,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +499,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +547,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 92d5c59654..fc6794ef65 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,7 +41,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v32-0001-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v32-0001-Add-deduplication-to-nbtree.patchDownload

From e4086257a20280c49c2f1401f5fe9ecae4d318a5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 25 Jan 2020 14:40:46 -0800
Subject: [PATCH v32 1/3] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
"posting list tuples" are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.

Deduplication merges together duplicates that happen to have been
created by an UPDATE that did not use an optimization like heapam's
Heap-only tuples (HOT).  Deduplication is effective at absorbing
"version bloat" without any special knowledge of row versions or of
MVCC.  Deduplication is applied within unique indexes for this reason,
though the criteria for triggering a deduplication is slightly
different.  Deduplication of a unique index is triggered only when the
incoming item is a duplicate of an existing item (and when the page
would otherwise split), which is a sure sign of "version bloat".

The lazy approach taken by nbtree has significant advantages over a
GIN style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.  A
new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all B-Tree indexes use
deduplication when only deduplication safe operator classes are used.
We should review this decision at the end of the Postgres 13 beta
period.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly lessen the ongoing cost of
vacuuming indexes.

There is a regression of approximately 2% of transaction throughput with
workloads that consist of append-only inserts into a table with several
non-unique indexes, where all indexes have few or no repeated values.
This is tentatively considered to be an acceptable downside to enabling
deduplication by default.  Again, the final word on this will come at
the end of the beta period, when we get some feedback from users.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since deduplication only affects the physical
representation of tuples.  However, users must still REINDEX a
pg_upgrade'd index to before its leaf page splits will apply
deduplication.  An index build is the only way to set the new nbtree
metapage flag indicating that deduplication is generally safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h               | 435 ++++++++++--
 src/include/access/nbtxlog.h              | 117 ++-
 src/include/access/rmgrlist.h             |   2 +-
 src/backend/access/common/reloptions.c    |   9 +
 src/backend/access/index/genam.c          |   4 +
 src/backend/access/nbtree/Makefile        |   1 +
 src/backend/access/nbtree/README          | 133 +++-
 src/backend/access/nbtree/nbtdedup.c      | 823 ++++++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c     | 387 ++++++++--
 src/backend/access/nbtree/nbtpage.c       | 244 ++++++-
 src/backend/access/nbtree/nbtree.c        | 171 ++++-
 src/backend/access/nbtree/nbtsearch.c     | 271 ++++++-
 src/backend/access/nbtree/nbtsort.c       | 191 ++++-
 src/backend/access/nbtree/nbtsplitloc.c   |  39 +-
 src/backend/access/nbtree/nbtutils.c      | 226 +++++-
 src/backend/access/nbtree/nbtxlog.c       | 268 ++++++-
 src/backend/access/rmgrdesc/nbtdesc.c     |  22 +-
 src/backend/storage/page/bufpage.c        |   9 +-
 src/bin/psql/tab-complete.c               |   4 +-
 contrib/amcheck/verify_nbtree.c           | 231 ++++--
 doc/src/sgml/btree.sgml                   | 136 +++-
 doc/src/sgml/charset.sgml                 |   9 +-
 doc/src/sgml/ref/create_index.sgml        |  37 +-
 src/test/regress/expected/btree_index.out |  20 +-
 src/test/regress/sql/btree_index.sql      |  22 +-
 25 files changed, 3486 insertions(+), 325 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 20ace69dab..672bc5d8b4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -108,6 +108,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_safededup;	/* deduplication known to be safe? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -124,6 +125,13 @@ typedef struct BTMetaPageData
  * need to be immediately re-indexed at pg_upgrade.  In order to get the
  * new heapkeyspace semantics, however, a REINDEX is needed.
  *
+ * Deduplication is safe to use when the btm_safededup field is set to
+ * true.  It's safe to read the btm_safededup field on version 3, but only
+ * version 4 indexes make use of deduplication.  Even version 4 indexes
+ * created on PostgreSQL v12 will need a REINDEX to make use of
+ * deduplication, though, since there is no other way to set btm_safededup
+ * to true (pg_upgrade hasn't been taught to set the metapage field).
+ *
  * Btree version 2 is mostly the same as version 3.  There are two new
  * fields in the metapage that were introduced in version 3.  A version 2
  * metapage will be automatically upgraded to version 3 on the first
@@ -156,6 +164,21 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
+ * that may be stored on a btree leaf page.  It is used to size the
+ * per-page temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-tuple overheads here to keep
+ * things simple (value is based on how many elements a single array of
+ * heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.
+ */
+#define MaxTIDsPerBTreePage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +253,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -264,7 +286,8 @@ typedef struct BTMetaPageData
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
  * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
- * t_info is always set on BTREE_VERSION 4 pivot tuples.
+ * t_info is always set on BTREE_VERSION 4 pivot tuples, since
+ * BTreeTupleIsPivot() must work reliably on heapkeyspace versions.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -279,90 +302,256 @@ typedef struct BTMetaPageData
  * The 12 least significant offset bits from t_tid are used to represent
  * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
- * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
- * number of columns/attributes <= INDEX_MAX_KEYS.
+ * future use.  BT_OFFSET_MASK should be large enough to store any number
+ * of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  PostgreSQL v13 introduced a
+ * new non-pivot tuple format to support deduplication: posting list
+ * tuples.  Deduplication merges together multiple equal non-pivot tuples
+ * into a logically equivalent, space efficient representation.  A posting
+ * list is an array of ItemPointerData elements.  Non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).  BT_OFFSET_MASK should be large enough to store
+ * any number of posting list TIDs that might be present in a tuple (since
+ * tuple size is subject to the INDEX_SIZE_MASK limit).
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
-#define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_OFFSET_MASK				0x0FFF
 #define BT_PIVOT_HEAP_TID_ATTR		0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined as a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
-			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_PIVOT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_PIVOT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -434,6 +623,9 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * safededup is set to indicate that index may use deduplication safely.
+ * This is also a property of the index relation rather than an indexscan.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -469,6 +661,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		safededup;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -507,10 +700,94 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * State used to representing an individual pending tuple during
+ * deduplication.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} BTDedupInterval;
+
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal items
+ * from the page, starting from page offset number 'baseoff'.  This is the
+ * offset number of the "base" tuple for new posting list.  'nitems' is the
+ * current total number of existing items from the page that will be merged to
+ * make a new posting list tuple, including the base tuple item.  (Existing
+ * items may themselves be posting list tuples, or regular non-pivot tuples.)
+ *
+ * The total size of the existing tuples to be freed when pending posting list
+ * is processed gets tracked by 'phystupsize'.  This information allows
+ * deduplication to calculate the space saving for each new posting list
+ * tuple, and for the entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	bool		deduplicate;	/* Still deduplicating page? */
+	Size		maxpostingsize; /* Limit on size of final tuple */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without original posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing tuples/line pointers */
+	Size		phystupsize;	/* Includes line pointer overhead */
+
+	/*
+	 * Array of tuples to go on new version of the page.  Contains one entry
+	 * for each group of consecutive items.  Note that existing tuples that
+	 * will not become posting list tuples do not appear in the array (they
+	 * are implicitly unchanged by deduplication pass).
+	 */
+	int			nintervals;		/* current size of intervals array */
+	BTDedupInterval intervals[MaxIndexTuplesPerPage];
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
+/*
+ * BTVacuumPostingData is state that represents how to VACUUM a posting list
+ * tuple when some (though not all) of its TIDs are to be deleted.
+ *
+ * Convention is that itup field is the original posting list tuple on input,
+ * and palloc()'d final tuple used to overwrite existing tuple on output.
+ */
+typedef struct BTVacuumPostingData
+{
+	/* Tuple that will be/was updated */
+	IndexTuple	itup;
+	OffsetNumber updatedoffset;
+
+	/* State needed to describe final itup in WAL */
+	uint16		ndeletedtids;
+	uint16		deletetids[FLEXIBLE_ARRAY_MEMBER];
+} BTVacuumPostingData;
+
+typedef BTVacuumPostingData *BTVacuumPosting;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -534,7 +811,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list
+ * tuple.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -578,7 +857,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -686,6 +965,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplicate_items;	/* Use deduplication if safe? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -696,6 +976,11 @@ typedef struct BTOptions
 	 BTREE_DEFAULT_FILLFACTOR)
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetDeduplicateItems(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplicate_items : true))
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -742,6 +1027,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber baseoff);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Page newpage, BTDedupState state);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern void _bt_update_posting(BTVacuumPosting vacposting);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -760,14 +1061,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool safededup);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *safededup);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -776,7 +1079,8 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								BTVacuumPosting *updatable, int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
@@ -829,6 +1133,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..3d113c511e 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP		0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		safededup;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +154,33 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents a deduplication pass for a leaf page.  An array
+ * of BTDedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+	uint16		nintervals;
+
+	/* DEDUPLICATION INTERVALS FOLLOW */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * the TIDs within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -150,21 +211,43 @@ typedef struct xl_btree_reuse_page
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
 
 /*
- * This is what we need to know about vacuum of individual leaf index tuples.
- * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
+ * This is what we need to know about which TIDs to remove from an individual
+ * posting list tuple during vacuuming.  An array of these may appear at the
+ * end of xl_btree_vacuum records.
+ */
+typedef struct xl_btree_update
+{
+	uint16		ndeletedtids;
+
+	/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
+} xl_btree_update;
+
+#define SizeOfBtreeUpdate	(offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
+
+/*
+ * This is what we need to know about a VACUUM of a leaf page.  The WAL record
+ * can represent deletion of any number of index tuples on a single index page
+ * when executed by VACUUM.  It can also support "updates" of index tuples,
+ * which is how deletes of a subset of TIDs contained in an existing posting
+ * list tuple are implemented. (Updates are only used when there will be some
+ * remaining TIDs once VACUUM finishes; otherwise the posting list tuple can
+ * just be deleted).
  *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * Updated posting list tuples are represented using xl_btree_update metadata.
+ * The REDO routine uses each xl_btree_update (plus its corresponding original
+ * index tuple from the target leaf page) to generate the final updated tuple.
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES METADATA ARRAY FOLLOWS */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +328,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..f2b03a6cfc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplicate_items",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..6499f5adb7 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every table TID within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,134 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.  Storing
+duplicates in a space efficient manner is not the goal, since in the long
+run there won't be any duplicates anyway.  Rather, we're buying time for
+standard garbage collection mechanisms to run before a page split is
+needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves a situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost/max
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..7ef5de1f4f
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,823 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+							 OffsetNumber minoff, IndexTuple newitem);
+static void _bt_singleval_fillfactor(Page page, BTDedupState state,
+									 Size newitemsz);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_posting_valid(IndexTuple posting);
+#endif
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible.  Note, however, that "single
+ * value" strategy is sometimes used for !checkingunique callers, in which
+ * case deduplication will leave a few tuples untouched at the end of the
+ * page.  The general idea is to prepare the page for an anticipated page
+ * split that uses nbtsplitloc.c's "single value" strategy to determine a
+ * split point.  (There is no reason to deduplicate items that will end up on
+ * the right half of the page after the anticipated page split; better to
+ * handle those if and when the anticipated right half page gets its own
+ * deduplication pass, following further inserts of duplicates.)
+ *
+ * This function should be called during insertion, when the page doesn't have
+ * enough space to fit an incoming newitem.  If the BTP_HAS_GARBAGE page flag
+ * was set, caller should have removed any LP_DEAD items by calling
+ * _bt_vacuum_one_page() before calling here.  We may still have to kill
+ * LP_DEAD items here when the page's BTP_HAS_GARBAGE hint is falsely unset,
+ * but that should be rare.  Also, _bt_vacuum_one_page() won't unset the
+ * BTP_HAS_GARBAGE flag when it finds no LP_DEAD items, so a successful
+ * deduplication pass will always clear it, just to keep things tidy.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	Page		newpage;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			ndeletable = 0;
+	int			pagenitems = 0;
+	Size		pagesaving = 0;
+	bool		singlevalstrat = false;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * We can't assume that there are no LP_DEAD items.  For one thing, VACUUM
+	 * will clear the BTP_HAS_GARBAGE hint without reliably removing items
+	 * that are marked LP_DEAD.  We don't want to unnecessarily unset LP_DEAD
+	 * bits when deduplicating items.  Allowing it would be correct, though
+	 * wasteful.
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split using the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 *
+	 * It would be possible for maxpostingsize (limit on posting list tuple
+	 * size) to be set to one third of the page.  However, it seems like a
+	 * good idea to limit the size of posting lists to one sixth of a page.
+	 * That ought to leave us with a good split point when pages full of
+	 * duplicates can be split several times.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->deduplicate = true;
+	state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = palloc(state->maxpostingsize);
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+	/* nintervals should be initialized to zero */
+	state->nintervals = 0;
+
+	/* Determine if "single value" strategy should be used */
+	if (!checkingunique)
+		singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+	/*
+	 * Deduplicate items from page, and write them to newpage.
+	 *
+	 * Copy the original page's LSN into newpage copy.  This will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	PageSetLSN(newpage, PageGetLSN(page));
+
+	/* Copy high key, if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (offnum == minoff)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (state->deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed current
+			 * maxpostingsize).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and actually update the page.  Else
+			 * reset the state and move on without modifying the page.
+			 */
+			pagesaving += _bt_dedup_finish_pending(newpage, state);
+			pagenitems++;
+
+			if (singlevalstrat)
+			{
+				/*
+				 * Single value strategy's extra steps.
+				 *
+				 * Lower maxpostingsize for sixth and final item that might be
+				 * deduplicated by current deduplication pass.  When sixth
+				 * item formed/observed, stop deduplicating items.
+				 *
+				 * Note: It's possible that this will be reached even when
+				 * current deduplication pass has yet to merge together some
+				 * existing items.  It doesn't matter whether or not the
+				 * current call generated the maxpostingsize-capped duplicate
+				 * tuples at the start of the page.
+				 */
+				if (pagenitems == 5)
+					_bt_singleval_fillfactor(page, state, newitemsz);
+				else if (pagenitems == 6)
+				{
+					state->deduplicate = false;
+					singlevalstrat = false; /* won't be back here */
+				}
+			}
+
+			/* itup starts new pending posting list */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_finish_pending(newpage, state);
+	pagenitems++;
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (state->nintervals == 0)
+	{
+		/* cannot leak memory here */
+		pfree(newpage);
+		pfree(state->htids);
+		pfree(state);
+		return;
+	}
+
+	/*
+	 * By here, it's clear that deduplication will definitely go ahead.
+	 *
+	 * Clear the BTP_HAS_GARBAGE gpage flag in the unlikely event that it is
+	 * still falsely set, just to keep things tidy.  (We can't rely on
+	 * _bt_vacuum_one_page() having done this already, and we can't rely on a
+	 * page split or VACUUM getting to it in the near future.)
+	 */
+	if (P_HAS_GARBAGE(opaque))
+	{
+		BTPageOpaque nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.nintervals = state->nintervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/*
+		 * The intervals array is not in the buffer, but pretend that it is.
+		 * When XLogInsert stores the whole buffer, the array need not be
+		 * stored too.
+		 */
+		XLogRegisterBufData(0, (char *) state->intervals,
+							state->nintervals * sizeof(BTDedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's base tuple.
+ *
+ * Every tuple processed by deduplication either becomes the base tuple for a
+ * posting list, or gets its heap TID(s) accepted into a pending posting list.
+ * A tuple that starts out as the base tuple for a posting list will only
+ * actually be rewritten within _bt_dedup_finish_pending() when it turns out
+ * that there are duplicates that can be merged into the base tuple.
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TID(s) from new base tuple for new candidate posting list
+	 * into working state's array
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, &base->t_tid, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* basetupsize should not include existing posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain physical size of all existing tuples (including line
+	 * pointer overhead) so that we can calculate space savings on page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->intervals[state->nintervals].baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state now
+ * includes itup's heap TID(s).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over maxpostingsize limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxpostingsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Page newpage, BTDedupState state)
+{
+	OffsetNumber tupoff;
+	Size		tuplesz;
+	Size		spacesaving;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+	tupoff = OffsetNumberNext(PageGetMaxOffsetNumber(newpage));
+	if (state->nitems == 1)
+	{
+		/* Use original, unchanged base tuple */
+		tuplesz = IndexTupleSize(state->base);
+		if (PageAddItem(newpage, (Item) state->base, tuplesz, tupoff,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		spacesaving = 0;
+	}
+	else
+	{
+		IndexTuple	final;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		tuplesz = IndexTupleSize(final);
+		Assert(tuplesz <= state->maxpostingsize);
+
+		/* Save final number of items for posting list */
+		state->intervals[state->nintervals].nitems = state->nitems;
+
+		Assert(tuplesz == MAXALIGN(IndexTupleSize(final)));
+		if (PageAddItem(newpage, (Item) final, tuplesz, tupoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		pfree(final);
+		spacesaving = state->phystupsize - (tuplesz + sizeof(ItemIdData));
+		/* Increment nintervals, since we wrote a new posting list tuple */
+		state->nintervals++;
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied.  The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value.  When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split.  The first deduplication pass should only
+ * find regular non-pivot tuples.  Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over.  The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples.  The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive.  Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes.  Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ *
+ * Note: We deliberately don't bother checking if the high key is a distinct
+ * value (prior to the TID tiebreaker column) before proceeding, unlike
+ * nbtsplitloc.c.  Its single value strategy only gets applied on the
+ * rightmost page of duplicates of the same value (other leaf pages full of
+ * duplicates will get a simple 50:50 page split instead of splitting towards
+ * the end of the page).  There is little point in making the same distinction
+ * here.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+				 OffsetNumber minoff, IndexTuple newitem)
+{
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	ItemId		itemid;
+	IndexTuple	itup;
+
+	itemid = PageGetItemId(page, minoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+
+	if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+	{
+		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Lower maxpostingsize when using "single value" strategy, to avoid a sixth
+ * and final maxpostingsize-capped tuple.  The sixth and final posting list
+ * tuple will end up somewhat smaller than the first five.  (Note: The first
+ * five tuples could actually just be very large duplicate tuples that
+ * couldn't be merged together at all.  Deduplication will simply not modify
+ * the page when that happens.)
+ *
+ * When there are six posting lists on the page (after current deduplication
+ * pass goes on to create/observe a sixth very large tuple), caller should end
+ * its deduplication pass.  It isn't useful to try to deduplicate items that
+ * are supposed to end up on the new right sibling page following the
+ * anticipated page split.  A future deduplication pass of future right
+ * sibling page might take care of it.  (This is why the first single value
+ * strategy deduplication pass for a given leaf page will generally find only
+ * plain non-pivot tuples -- see _bt_do_singleval() comments.)
+ */
+static void
+_bt_singleval_fillfactor(Page page, BTDedupState state, Size newitemsz)
+{
+	Size		leftfree;
+	int			reduction;
+
+	/* This calculation needs to match nbtsplitloc.c */
+	leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+	/* Subtract size of new high key (includes pivot heap TID space) */
+	leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+	/*
+	 * Reduce maxpostingsize by an amount equal to target free space on left
+	 * half of page
+	 */
+	reduction = leftfree * ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+	if (state->maxpostingsize > reduction)
+		state->maxpostingsize -= reduction;
+	else
+		state->maxpostingsize = 0;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in htids array.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/* Determine final size of new tuple */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(_bt_posting_valid(itup));
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+		Assert(ItemPointerIsValid(&itup->t_tid));
+	}
+
+	return itup;
+}
+
+/*
+ * Generate a replacement tuple by "updating" a posting list tuple so that it
+ * no longer has TIDs that need to be deleted.
+ *
+ * Used by VACUUM.  Caller's vacposting argument points to the existing
+ * posting list tuple to be updated.
+ *
+ * On return, caller's vacposting argument will point to final "updated"
+ * tuple, which will be palloc()'d in caller's memory context.
+ */
+void
+_bt_update_posting(BTVacuumPosting vacposting)
+{
+	IndexTuple	origtuple = vacposting->itup;
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+	int			nhtids;
+	int			ui,
+				d;
+	ItemPointer htids;
+
+	nhtids = BTreeTupleGetNPosting(origtuple) - vacposting->ndeletedtids;
+
+	Assert(_bt_posting_valid(origtuple));
+	Assert(nhtids > 0 && nhtids < BTreeTupleGetNPosting(origtuple));
+
+	if (BTreeTupleIsPosting(origtuple))
+		keysize = BTreeTupleGetPostingOffset(origtuple);
+	else
+		keysize = IndexTupleSize(origtuple);
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.  We avoid calling _bt_form_posting() here
+	 * to save ourselves a second memory allocation for a htids workspace.
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, origtuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		htids = BTreeTupleGetPosting(itup);
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		htids = &itup->t_tid;
+	}
+
+	ui = 0;
+	d = 0;
+	for (int i = 0; i < BTreeTupleGetNPosting(origtuple); i++)
+	{
+		if (d < vacposting->ndeletedtids && vacposting->deletetids[d] == i)
+		{
+			d++;
+			continue;
+		}
+		htids[ui++] = *BTreeTupleGetPostingN(origtuple, i);
+	}
+	Assert(ui == nhtids);
+	Assert(d == vacposting->ndeletedtids);
+	Assert(nhtids == 1 || _bt_posting_valid(itup));
+
+	/* vacposting arg's itup will now point to updated version */
+	vacposting->itup = itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should pass their own private
+ * copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This happens inside the same critical
+ * section that performs an in-place update of old posting list using new
+ * posting list returned here.)
+ *
+ * While the keys from newitem and oposting must be opclass equal, and must
+ * generate identical output when run through the underlying type's output
+ * function, it doesn't follow that their representations match exactly.
+ * Caller must avoid assuming that there can't be representational differences
+ * that make datums from oposting bigger or smaller than the corresponding
+ * datums from newitem.  For example, differences in TOAST input state might
+ * break a faulty assumption about tuple size (the executor is entitled to
+ * apply TOAST compression based on its own criteria).  It also seems possible
+ * that further representational variation will be introduced in the future,
+ * in order to support nbtree features like page-level prefix compression.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *replaceposright;
+	Size		nmovebytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(_bt_posting_valid(oposting));
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID.  We shift TIDs one place to the right, losing original
+	 * rightmost TID. (nmovebytes must not include TIDs to the left of
+	 * postingoff, nor the existing rightmost/max TID that gets overwritten.)
+	 */
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);
+	nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+	memmove(replaceposright, replacepos, nmovebytes);
+
+	/* Fill the gap at postingoff with TID of new item (original new TID) */
+	Assert(!BTreeTupleIsPivot(newitem) && !BTreeTupleIsPosting(newitem));
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Now copy oposting's rightmost/max TID into new item (final new TID) */
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(oposting), &newitem->t_tid);
+
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(_bt_posting_valid(nposting));
+
+	return nposting;
+}
+
+/*
+ * Verify posting list invariants for "posting", which must be a posting list
+ * tuple.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_posting_valid(IndexTuple posting)
+{
+	ItemPointerData last;
+	ItemPointer htid;
+
+	if (!BTreeTupleIsPosting(posting) || BTreeTupleGetNPosting(posting) < 2)
+		return false;
+
+	/* Remember first heap TID for loop */
+	ItemPointerCopy(BTreeTupleGetHeapTID(posting), &last);
+	if (!ItemPointerIsValid(&last))
+		return false;
+
+	/* Iterate, starting from second TID */
+	for (int i = 1; i < BTreeTupleGetNPosting(posting); i++)
+	{
+		htid = BTreeTupleGetPostingN(posting, i);
+
+		if (!ItemPointerIsValid(htid))
+			return false;
+		if (ItemPointerCompare(htid, &last) <= 0)
+			return false;
+		ItemPointerCopy(htid, &last);
+	}
+
+	return true;
+}
+#endif
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4e5849ab8e..ff1711d6c0 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +127,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -295,7 +298,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -340,6 +343,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 uint32 *speculativeToken)
 {
 	IndexTuple	itup = insertstate->itup;
+	IndexTuple	curitup;
+	ItemId		curitemid;
 	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
 	OffsetNumber offset;
@@ -348,6 +353,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -375,13 +383,21 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
-		ItemId		curitemid;
-		IndexTuple	curitup;
-		BlockNumber nblkno;
-
 		/*
-		 * make sure the offset points to an actual item before trying to
-		 * examine it...
+		 * Each iteration of the loop processes one heap TID, not one index
+		 * tuple.  Current offset number for page isn't usually advanced on
+		 * iterations that process heap TIDs from posting list tuples.
+		 *
+		 * "inposting" state is set when _inside_ a posting list --- not when
+		 * we're at the start (or end) of a posting list.  We advance curposti
+		 * at the end of the iteration when inside a posting list tuple.  In
+		 * general, every loop iteration either advances the page offset or
+		 * advances curposti --- an iteration that handles the rightmost/max
+		 * heap TID in a posting list finally advances the page offset (and
+		 * unsets "inposting").
+		 *
+		 * Make sure the offset points to an actual index tuple before trying
+		 * to examine it...
 		 */
 		if (offset <= maxoff)
 		{
@@ -406,31 +422,60 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				break;
 			}
 
-			curitemid = PageGetItemId(page, offset);
-
 			/*
-			 * We can skip items that are marked killed.
+			 * We can skip items that are already marked killed.
 			 *
 			 * In the presence of heavy update activity an index may contain
 			 * many killed items with the same key; running _bt_compare() on
 			 * each killed item gets expensive.  Just advance over killed
 			 * items as quickly as we can.  We only apply _bt_compare() when
-			 * we get to a non-killed item.  Even those comparisons could be
-			 * avoided (in the common case where there is only one page to
-			 * visit) by reusing bounds, but just skipping dead items is fast
-			 * enough.
+			 * we get to a non-killed item.  We could reuse the bounds to
+			 * avoid _bt_compare() calls for known equal tuples, but it
+			 * doesn't seem worth it.  Workloads with heavy update activity
+			 * tend to have many deduplication passes, so we'll often avoid
+			 * most of those comparisons, too (we call _bt_compare() when the
+			 * posting list tuple is initially encountered, though not when
+			 * processing later TIDs from the same tuple).
 			 */
-			if (!ItemIdIsDead(curitemid))
+			if (!inposting)
+				curitemid = PageGetItemId(page, offset);
+			if (inposting || !ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
 				bool		all_dead;
 
-				if (_bt_compare(rel, itup_key, page, offset) != 0)
-					break;		/* we're past all the equal tuples */
+				if (!inposting)
+				{
+					/* Plain tuple, or first TID in posting list tuple */
+					if (_bt_compare(rel, itup_key, page, offset) != 0)
+						break;	/* we're past all the equal tuples */
 
-				/* okay, we gotta fetch the heap tuple ... */
-				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+					/* Advanced curitup */
+					curitup = (IndexTuple) PageGetItem(page, curitemid);
+					Assert(!BTreeTupleIsPivot(curitup));
+				}
+
+				/* okay, we gotta fetch the heap tuple using htid ... */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					/* ... htid is from simple non-pivot tuple */
+					Assert(!inposting);
+					htid = curitup->t_tid;
+				}
+				else if (!inposting)
+				{
+					/* ... htid is first TID in new posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+					htid = *BTreeTupleGetPostingN(curitup, 0);
+				}
+				else
+				{
+					/* ... htid is second or subsequent TID in posting list */
+					Assert(curposti > 0);
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
+				}
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -506,8 +551,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -565,12 +609,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -584,14 +630,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -606,7 +667,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
 			{
-				nblkno = opaque->btpo_next;
+				BlockNumber nblkno = opaque->btpo_next;
+
 				nbuf = _bt_relandgetbuf(rel, nbuf, nblkno, BT_READ);
 				page = BufferGetPage(nbuf);
 				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -616,6 +678,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			/* Will also advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -684,6 +749,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -699,6 +765,9 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		/* Keep track of whether checkingunique duplicate seen */
+		bool		uniquedup = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -715,6 +784,13 @@ _bt_findinsertloc(Relation rel,
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				uniquedup = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -741,18 +817,43 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (if checkingunique) */
+				uniquedup = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that fails to free enough space, see if
+		 * we can avoid a page split by performing a deduplication pass over
+		 * the page.
+		 *
+		 * We only perform a deduplication pass for a checkingunique caller
+		 * when the incoming item is a duplicate of an existing item on the
+		 * leaf page.  This heuristic avoids wasting cycles -- we only expect
+		 * to benefit from deduplicating a unique index page when most or all
+		 * recently added items are duplicates.  See nbtree/README.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates (if checkingunique) */
+				uniquedup = true;
+			}
+
+			if (itup_key->safededup && BTGetDeduplicateItems(rel) &&
+				(!checkingunique || uniquedup) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -834,7 +935,30 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  We don't want to unnecessarily unset its LP_DEAD bit while
+		 * performing a posting list split, so delete all LP_DEAD items early.
+		 * Note that this is the only case where LP_DEAD deletes happen even
+		 * though there is space for newitem on the page.
+		 */
+		_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+
+		/*
+		 * Do new binary search.  New insert location cannot overlap with any
+		 * posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -900,10 +1024,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -931,11 +1057,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -949,6 +1079,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -959,6 +1090,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->safededup);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost/max TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -991,7 +1150,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,6 +1226,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1115,8 +1278,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1139,6 +1313,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.safededup = metad->btm_safededup;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1147,7 +1322,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1189,6 +1384,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1204,12 +1407,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1229,6 +1444,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1298,6 +1514,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1335,6 +1579,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1368,6 +1614,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1383,6 +1631,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1447,6 +1696,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1475,8 +1725,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1645,8 +1903,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1665,11 +1927,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1829,7 +2115,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2185,6 +2471,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2265,7 +2552,7 @@ _bt_pgaddtup(Page page,
 static void
 _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 {
-	OffsetNumber deletable[MaxOffsetNumber];
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				minoff,
@@ -2298,6 +2585,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..362330539e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool safededup)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_safededup = safededup;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_safededup);
+	metad->btm_safededup = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.safededup = metad->btm_safededup;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_safededup ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,33 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *safededup arguments using data from
+ *		the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets field that indicates to caller whether or not it is safe to
+ *		apply deduplication within index.  Note that we rely on the assumption
+ *		that btm_safededup will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *safededup)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +674,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*safededup = metad->btm_safededup;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +702,11 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*safededup = metad->btm_safededup;
 }
 
 /*
@@ -964,28 +990,106 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with deleting TIDs when some (but not all) of the heap TIDs
+ * in an existing posting list item are to be removed by VACUUM.  This works
+ * by updating/overwriting an existing item with caller's new version of the
+ * item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					BTVacuumPosting *updatable, int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen = 0;
+	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* Replace work area IndexTuple with updated version */
+		_bt_update_posting(updatable[i]);
+
+		/* Maintain array of updatable page offsets for WAL record */
+		updatedoffsets[i] = updatable[i]->updatedoffset;
+	}
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+		{
+			BTVacuumPosting vacposting = updatable[i];
+
+			itemsz = SizeOfBtreeUpdate +
+				vacposting->ndeletedtids * sizeof(uint16);
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			BTVacuumPosting vacposting = updatable[i];
+			xl_btree_update update;
+
+			update.ndeletedtids = vacposting->ndeletedtids;
+			memcpy(updatedbuf + offset, &update.ndeletedtids,
+				   SizeOfBtreeUpdate);
+			offset += SizeOfBtreeUpdate;
+
+			itemsz = update.ndeletedtids * sizeof(uint16);
+			memcpy(updatedbuf + offset, vacposting->deletetids, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber updatedoffset = updatedoffsets[i];
+		IndexTuple	itup;
+
+		itup = updatable[i]->itup;
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+		if (!PageIndexTupleOverwrite(page, updatedoffset, (Item) itup,
+									 itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1110,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1125,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatedoffsets,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1148,13 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
+	/* free tuples generated by calling _bt_update_posting() */
+	for (int i = 0; i < nupdatable; i++)
+		pfree(updatable[i]->itup);
 }
 
 /*
@@ -1050,6 +1167,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire tuple can be
+ * deleted as a whole (since there is only one LP_DEAD bit per line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1184,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1231,83 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2253,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.safededup = metad->btm_safededup;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5254bc7ef5..a04f508474 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,10 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
+										  IndexTuple posting,
+										  OffsetNumber updatedoffset,
+										  int *nremaining);
 
 
 /*
@@ -161,7 +165,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -264,8 +268,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxTIDsPerBTreePage * sizeof(int));
+				if (so->numKilled < MaxTIDsPerBTreePage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1154,11 +1158,15 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		BTVacuumPosting updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1190,8 +1198,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1199,11 +1210,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1226,22 +1235,82 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					BTVacuumPosting vacposting;
+					int			nremaining;
+
+					/* Posting list tuple */
+					vacposting = btreevacuumposting(vstate, itup, offnum,
+													&nremaining);
+					if (vacposting == NULL)
+					{
+						/*
+						 * All table TIDs from the posting tuple remain, so no
+						 * delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+
+						/*
+						 * Store metadata about posting list tuple in
+						 * updatable array for entire page.  Existing tuple
+						 * will be updated during the later call to
+						 * _bt_delitems_vacuum().
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatable[nupdatable++] = vacposting;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+					}
+					else
+					{
+						/*
+						 * All table TIDs from the posting list must be
+						 * deleted.  We'll delete the index tuple completely
+						 * (no update required).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(vacposting);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updatable[i]);
 		}
 		else
 		{
@@ -1254,6 +1323,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1263,15 +1333,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
-		 * freePages out-of-order (doesn't seem worth any extra code to handle
-		 * the case).
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as separate live tuples).
+		 * We don't delete when recursing, though, to avoid putting entries
+		 * into freePages out-of-order (doesn't seem worth any extra code to
+		 * handle the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1303,9 +1376,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1314,6 +1388,61 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns metadata describing how to build replacement tuple without the TIDs
+ * that VACUUM needs to delete.  Returned value is NULL in the common case
+ * where no changes are needed to caller's posting list tuple (we avoid
+ * allocating memory here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is set for
+ * caller in *nremaining.
+ */
+static BTVacuumPosting
+btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+				   OffsetNumber updatedoffset, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer items = BTreeTupleGetPosting(posting);
+	BTVacuumPosting vacposting = NULL;
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/* Live table TID */
+			live++;
+		}
+		else if (vacposting == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining metadata describing how to update
+			 * existing posting list tuple.
+			 */
+			vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
+								nitem * sizeof(uint16));
+
+			vacposting->itup = posting;
+			vacposting->updatedoffset = updatedoffset;
+			vacposting->ndeletedtids = 0;
+			vacposting->deletetids[vacposting->ndeletedtids++] = i;
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+			vacposting->deletetids[vacposting->ndeletedtids++] = i;
+		}
+	}
+
+	*nremaining = live;
+	return vacposting;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..c60f3bd6a0 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->safededup);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +693,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * Scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * with scantid.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.safededup);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxTIDsPerBTreePage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index baec5de999..d9aec2784a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple has a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +965,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  Even still, the lastleft and firstright
+			 * tuples passed to _bt_truncate() here are at least not fully
+			 * equal to each other when deduplication is used, unless there is
+			 * a large group of duplicates (also, unique index builds usually
+			 * have few or no spool2 duplicates).  When the split point is
+			 * between two unequal tuples, _bt_truncate() will avoid including
+			 * a heap TID in the new high key, which is the most important
+			 * benefit of suffix truncation.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1007,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1069,43 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	Assert(dstate->nitems > 0);
+
+	if (dstate->nitems == 1)
+		_bt_buildadd(wstate, state, dstate->base, 0);
+	else
+	{
+		IndexTuple	postingtuple;
+		Size		truncextra;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		/* Calculate posting list overhead */
+		truncextra = IndexTupleSize(postingtuple) -
+			BTreeTupleGetPostingOffset(postingtuple);
+
+		_bt_buildadd(wstate, state, postingtuple, truncextra);
+		pfree(postingtuple);
+	}
+
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1151,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1172,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->safededup);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1194,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->safededup &&
+		BTGetDeduplicateItems(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1294,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1309,100 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->deduplicate = true; /* unused */
+		dstate->maxpostingsize = 0; /* set later */
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+		dstate->nintervals = 0; /* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to 1/10 space we want to
+				 * leave behind on the page, plus space for final item's line
+				 * pointer.  This is equal to the space that we'd like to
+				 * leave behind on each leaf page when fillfactor is 90,
+				 * allowing us to get close to fillfactor% space utilization
+				 * when there happen to be a great many duplicates.  (This
+				 * makes higher leaf fillfactor settings ineffective when
+				 * building indexes that have many duplicates, but packing
+				 * leaf pages full with few very large tuples doesn't seem
+				 * like a useful goal.)
+				 */
+				dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+					sizeof(ItemIdData);
+				Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+					   dstate->maxpostingsize <= INDEX_SIZE_MASK);
+				dstate->htids = palloc(dstate->maxpostingsize);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list.  Heap
+				 * TID from itup has been saved in state.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * _bt_dedup_save_htid() opted to not merge current item into
+				 * pending posting list.
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				pfree(dstate->base);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		if (state)
+		{
+			/*
+			 * Handle the last item (there must be a last item when the
+			 * tuplesort returned one or more tuples)
+			 */
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1410,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..5ed09640ad 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -107,7 +108,13 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->safededup);
+	else
+	{
+		key->heapkeyspace = true;
+		key->safededup = _bt_opclasses_support_dedup(rel);
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1373,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1534,6 +1542,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1773,10 +1782,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2017,7 +2081,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplicate_items", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplicate_items)}
 
 	};
 
@@ -2118,11 +2184,10 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
@@ -2138,6 +2203,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2154,6 +2232,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2171,6 +2251,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(firstright))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include the max lastleft
+			 * heap TID in the final pivot tuple, but we can remove the
+			 * posting list now. (Pivot tuples should never contain a posting
+			 * list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2198,7 +2291,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2302,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2224,7 +2320,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2233,7 +2329,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2411,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not deduplication-safe.  This weaker
+ * guarantee is good enough for nbtsplitloc.c caller, since false negatives
+ * generally only have the effect of making leaf page splits use a more
+ * balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2392,28 +2492,42 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * Mask allocated for number of keys in index tuple must be able to fit
 	 * maximum possible number of index attributes
 	 */
-	StaticAssertStmt(BT_N_KEYS_OFFSET_MASK >= INDEX_MAX_KEYS,
-					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
+	StaticAssertStmt(BT_OFFSET_MASK >= INDEX_MAX_KEYS,
+					 "BT_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2457,12 +2571,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2488,7 +2602,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2558,11 +2676,53 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.  Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(index) !=
+		IndexRelationGetNumberOfKeyAttributes(index))
+		return false;
+
+	/*
+	 * There is no reason why deduplication cannot be used with system catalog
+	 * indexes.  However, we deem it generally unsafe because it's not clear
+	 * how it could be disabled.  (ALTER INDEX is not supported with system
+	 * catalog indexes, so users have no way to set the storage parameter.)
+	 */
+	if (IsCatalogRelation(index))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+	{
+		Oid			opfamily = index->rd_opfamily[i];
+		Oid			collation = index->rd_indcollation[i];
+
+		/* TODO add adequate check of opclasses and collations */
+		elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+			 RelationGetRelationName(index), i, opfamily, collation);
+
+		/* NUMERIC btree opfamily OID is 1988 */
+		if (opfamily == 1988)
+			return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..6640f33dd3 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_safededup = xlrec->safededup;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,98 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		char	   *ptr = XLogRecGetBlockData(record, 0, NULL);
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		OffsetNumber offnum,
+					minoff,
+					maxoff;
+		BTDedupState state;
+		BTDedupInterval *intervals;
+		Page		newpage;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		state->deduplicate = true;	/* unused */
+		/* Conservatively use larger maxpostingsize than primary */
+		state->maxpostingsize = BTMaxItemSize(page);
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = palloc(state->maxpostingsize);
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+		state->nintervals = 0;
+
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		intervals = (BTDedupInterval *) ptr;
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == minoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (state->nintervals < xlrec->nintervals &&
+					 state->baseoff == intervals[state->nintervals].baseoff &&
+					 state->nitems < intervals[state->nintervals].nitems)
+			{
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+			else
+			{
+				_bt_dedup_finish_pending(newpage, state);
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+		}
+
+		_bt_dedup_finish_pending(newpage, state);
+		Assert(state->nintervals == xlrec->nintervals);
+		Assert(memcmp(state->intervals, intervals,
+					  state->nintervals * sizeof(BTDedupInterval)) == 0);
+
+		if (P_HAS_GARBAGE(opaque))
+		{
+			BTPageOpaque nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+			nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		}
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +575,56 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			xl_btree_update *updates;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updates = (xl_btree_update *) ((char *) updatedoffsets +
+										   xlrec->nupdated *
+										   sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				BTVacuumPosting vacposting;
+				IndexTuple	origtuple;
+				ItemId		itemid;
+				Size		itemsz;
+
+				itemid = PageGetItemId(page, updatedoffsets[i]);
+				origtuple = (IndexTuple) PageGetItem(page, itemid);
+
+				vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
+									updates->ndeletedtids * sizeof(uint16));
+				vacposting->updatedoffset = updatedoffsets[i];
+				vacposting->itup = origtuple;
+				vacposting->ndeletedtids = updates->ndeletedtids;
+				memcpy(vacposting->deletetids,
+					   (char *) updates + SizeOfBtreeUpdate,
+					   updates->ndeletedtids * sizeof(uint16));
+
+				_bt_update_posting(vacposting);
+
+				/* Overwrite updated version of tuple */
+				itemsz = MAXALIGN(IndexTupleSize(vacposting->itup));
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) vacposting->itup, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				pfree(vacposting->itup);
+				pfree(vacposting);
+
+				/* advance to next xl_btree_update/update */
+				updates = (xl_btree_update *)
+					((char *) updates + SizeOfBtreeUpdate +
+					 updates->ndeletedtids * sizeof(uint16));
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +943,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +966,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +994,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..7bbe55c5cf 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,24 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "nintervals %u", xlrec->nintervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +140,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP:
+			id = "DEDUP";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 4ea6ea7a3d..f57ea0a0e7 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1048,8 +1048,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1135,7 +1137,8 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	}
 
 	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index dc03fbde13..b6b08d0ccb 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1731,14 +1731,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplicate_items",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplicate_items =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..359b5c18dc 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_plain_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					safededup;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &safededup);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
+		 * bt_posting_plain_tuple() for definition, and details of how posting
+		 * list tuples are handled.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +927,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +958,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1000,20 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * TID, since the posting list itself is validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1025,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is a posting list tuple, make sure posting list TIDs are
+		 * in order
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1091,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1117,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "plain" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_plain_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1150,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1191,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && BTreeTupleIsPosting(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1219,8 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		/* Reset, in case scantid was set to (itup) posting tuple's max TID */
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2027,9 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2042,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2107,29 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "plain" tuple for nth posting list entry/TID.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple index tuples are merged together into one equivalent
+ * posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "plain"
+ * tuples.  Each tuple must be fingerprinted separately -- there must be one
+ * tuple for each corresponding Bloom filter probe during the heap scan.
+ *
+ * Note: Caller still needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_plain_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2186,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2194,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2650,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..de5b7c8db8 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -430,14 +430,142 @@ returns bool
 
 </sect1>
 
-<sect1 id="btree-implementation">
- <title>Implementation</title>
+<sect1 id="btree-storage">
+ <title>Physical Storage</title>
+
+ <para>
+  <productname>PostgreSQL</productname> B-Tree indexes are multi-level
+  tree structures, where each level of the tree can be used as a
+  doubly-linked list of pages.  A single metapage is stored in a fixed
+  position at the start of the first segment file of the index.  All
+  other pages are either leaf pages or internal pages.  Typically, the
+  vast majority of all pages are leaf pages unless the index is very
+  small.  Leaf pages are the pages on the lowest level of the tree.
+  All other levels consist of internal pages.  
+ </para>
+ <para>
+  Each leaf page contains tuples that point to table entries using a
+  heap item pointer.  Each tuple is considered unique internally,
+  since the item pointer is treated as a tiebreaker column.  Each
+  internal page contains tuples that point to the next level down in
+  the tree.  Both internal pages and leaf pages use the standard page
+  format described in <xref linkend="storage-page-layout"/>.  Index
+  scans use internal pages to locate the first leaf page that could
+  have matching tuples.
+ </para>
+
+ <sect2 id="btree-maintain-structure">
+  <title>Maintaining the Tree Structure</title>
+  <para>
+   New pages are added to a B-Tree index when an existing page becomes
+   full, and a <firstterm>page split</firstterm> is required to fit a
+   new item that belongs on the overflowing page.  New levels are
+   added to a B-Tree index when the root page becomes full, causing a
+   <firstterm>root page split</firstterm>.  Even the largest B-Tree
+   indexes rarely have more than four or five levels.
+  </para>
+  <para>
+   A much more technical guide to the B-Tree index implementation can
+   be found in <filename>src/backend/access/nbtree/README</filename>.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication">
+  <title>Posting List Tuples and Deduplication</title>
+  <para>
+   B-Tree indexes can perform <firstterm>deduplication</firstterm>.  A
+   <firstterm>duplicate</firstterm> is a row where
+   <emphasis>all</emphasis> indexed key columns are equal to the
+   corresponding column values from some other row.  Existing
+   duplicate leaf page tuples are merged together into a single
+   <quote>posting list</quote> tuple during a deduplication pass.  The
+   keys appear only once in this representation,  followed by a sorted
+   array of heap item pointers.  The deduplication process occurs
+   <quote>lazily</quote>, when a new item is inserted that cannot fit
+   on an existing leaf page.  Deduplication significantly reduces the
+   storage size of indexes where each value (or each distinct set of
+   values) appears several times on average.  This is likely to reduce
+   the amount of I/O required by index scasns, which can noticeably
+   improve overall query throughput.  It also reduces the overhead of
+   routine index vacuuming.
+  </para>
+  <para>
+   The <literal>deduplicate_items</literal> storage parameter can be
+   used to control deduplication within individual indexes.  See <xref
+   linkend="sql-createindex-storage-parameters"/> from the
+   <command>CREATE INDEX</command> documentation for details.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-versioning">
+  <title>MVCC Versioning and B-Tree Storage</title>
 
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   It is sometimes necessary for B-Tree indexes to contain multiple
+   physical tuples for the same logical table row, even in unique
+   indexes.  HOT updated rows avoid the need to store additional
+   physical versions in indexes, but an update that cannot use the HOT
+   optimization must store new physical tuples in
+   <emphasis>all</emphasis> indexes, including indexes with unchanged
+   indexed key values.  Multiple equal physical tuples that are only
+   needed to point to corresponding versions of the same logical table
+   row are common in some applications.
+  </para>
+  <para>
+   Deduplication tends to avoid page splits that are only needed due
+   to a short-term increase in <quote>duplicate</quote> tuples that
+   all point to different versions of the same logical table row.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but
+   <command>VACUUM</command> usually cannot reverse page splits (in
+   general, a leaf page must be completely empty before
+   <command>VACUUM</command> can <quote>delete</quote> it).  In
+   effect, deduplication delays <quote>version driven</quote> page
+   splits, which may give VACUUM enough time to run and prevent the
+   splits entirely.  Unique indexes make use of deduplication for this
+   reason.  Also, even unique indexes can have a set of
+   <quote>duplicate</quote> rows that are all visible to a given
+   <acronym>MVCC</acronym> snapshot, provided at least one column has
+   a NULL value.  In general, the implementation considers tuples with
+   NULL values to be duplicates for the purposes of deduplication.
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-limitations">
+  <title>Deduplication Limitations</title>
+
+  <para>
+   Workloads that don't benefit from deduplication due to having no
+   duplicate values in indexes will incur a small performance penalty
+   with mixed read-write workloads.  Unique indexes use a special
+   heuristic when considering whether to perform a deduplication pass,
+   avoiding this performance penalty in cases that cannot possibly
+   benefit from deduplication.  There is never any performance penalty
+   with read-only workloads, since reading from posting lists is at
+   least as efficient as reading the standard index tuple
+   representation.
+  </para>
+  <para>
+   Deduplication can only be used with a B-Tree index when
+   <emphasis>all</emphasis> columns use a deduplication-safe operator
+   class that explicitly indicates that deduplication is safe at
+   <command>CREATE INDEX</command> time.  In practice almost all
+   operator classes/datatypes support deduplication.
+   <type>numeric</type> is a notable exception (<quote>display
+   scale</quote> makes it impossible to enable deduplication without
+   losing useful information about equal <type>numeric</type> datums).
+   Some operator classes support deduplication conditionally.  For
+   example, deduplication of indexes on a <type>text</type> column
+   (with the default <literal>btree/text_ops</literal> operator class)
+   is not supported when the column uses a  nondeterministic
+   collation.
+  </para>
+  <para>
+   <literal>INCLUDE</literal> indexes do not support deduplication.
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 057a6bb81a..20cdfabd7b 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ab362a0dc5..221adab8f9 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Moreover, B-tree deduplication is never used with indexes that
+        have a non-key column.
        </para>
 
        <para>
@@ -393,10 +395,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
+    <term><literal>deduplicate_items</literal>
+     <indexterm>
+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to enable or
+      disable the optimization.  (Alternative spellings of
+      <literal>ON</literal> and <literal>OFF</literal> are allowed as
+      described in <xref linkend="config-setting"/>.) The default is
+      <literal>ON</literal>.
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplicate_items</literal> off via
+      <command>ALTER INDEX</command> prevents future insertions from
+      triggering deduplication, but does not in itself make existing
+      posting list tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -451,9 +482,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..1646deb092 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -200,7 +200,7 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 -- Also check LIKE optimization with binary-compatible cases
 create temp table btree_bpchar (f1 text collate "C");
-create index on btree_bpchar(f1 bpchar_ops);
+create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
 insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
 -- doesn't match index:
 explain (costs off)
@@ -266,6 +266,24 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+-- get test coverage for "single value" deduplication strategy:
+insert into btree_bpchar select 'foo' from generate_series(1,1500);
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..6e14b935ce 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -86,7 +86,7 @@ reset enable_bitmapscan;
 -- Also check LIKE optimization with binary-compatible cases
 
 create temp table btree_bpchar (f1 text collate "C");
-create index on btree_bpchar(f1 bpchar_ops);
+create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
 insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
 -- doesn't match index:
 explain (costs off)
@@ -103,6 +103,26 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+-- get test coverage for "single value" deduplication strategy:
+insert into btree_bpchar select 'foo' from generate_series(1,1500);
+
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

#133

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#132)

4 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Feb 6, 2020 at 6:18 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v32, which is even closer to being committable.

Attached is v33, which adds the last piece we need: opclass
infrastructure that tells nbtree whether or not deduplication can be
applied safely. This is based on work by Anastasia that was shared
with me privately.

I may not end up committing 0001-* as a separate patch, but it makes
sense to post it that way to make review easier -- this is supposed to
be infrastructure that isn't just useful for the deduplication patch.
0001-* adds a new C function, _bt_allequalimage(), which only actually
gets called within code added by 0002-* (i.e. the patch that adds the
deduplication feature). At this point, my main concern is that I might
not have the API exactly right in a world where these new support
functions are used by more than just the nbtree deduplication feature.
I would like to get detailed review of the new opclass infrastructure
stuff, and have asked for it directly, but I don't think that
committing the patch needs to block on that.

I've now written a fair amount of documentation for both the feature
and the underlying opclass infrastructure. It probably needs a bit
more copy-editing, but I think that it's generally in fairly good
shape. It might be a good idea for those who would like to review the
opclass stuff to start with some of my btree.sgml changes, and work
backwards -- the shape of the API itself is the important thing within
the 0001-* patch.

New opclass proc
================

In general, supporting deduplication is the rule for B-Tree opclasses,
rather than the exception. Most can use the generic
btequalimagedatum() routine as their support function 4, which
unconditionally indicates that deduplication is safe. There is a new
test that tries to catch opclasses that omitted to do this. Here is
the opr_sanity.out changes added by the first patch:

Those types/opclasses that you see here with a "proc" that is NULL
cannot use deduplication under any circumstances -- they have no
pg_amproc entry for B-Tree support function 4. The other four rows at
the start (those with a non-NULL "proc") are for collatable types,
where using deduplication is conditioned on not using a
nondeterministic collation. The details are in the sgml docs for the
second patch, where I go into the issue with numeric display scale,
why nondeterministic collations disable the use of deduplication, etc.

Note that these "equalimage" procs don't take any arguments, which is
a first for an index AM support function. Even still, we can take a
collation at CREATE INDEX time using the standard PG_GET_COLLATION()
mechanism. I suppose that it's a little bit odd to have no arguments
but still call PG_GET_COLLATION() in certain support functions. Still,
it works just fine, at least as far as the needs of deduplication are
concerned.

Since using deduplication is supposed to pretty much be the norm from
now on, it seemed like it might make sense to add a NOTICE about it
during CREATE INDEX -- a notice letting the user know that it isn't
being used due to a lack of opclass support:

regression=# create table foo(bar numeric);
CREATE TABLE
regression=# create index on foo(bar);
NOTICE: index "foo_bar_idx" cannot use deduplication
CREATE INDEX

Note that this NOTICE isn't seen with an INCLUDE index, since that's
expected to not support deduplication.

I have a feeling that not everybody will like this, which is why I'm
pointing it out.

Thoughts?

--
Peter Geoghegan

Attachments:

v33-0001-Add-equalimage-B-Tree-opclass-support-functions.patchapplication/x-patch; name=v33-0001-Add-equalimage-B-Tree-opclass-support-functions.patchDownload

From 5696e551583712168a62f9dd9bed2c7bdcbfccf6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 12 Feb 2020 10:17:25 -0800
Subject: [PATCH v33 1/4] Add equalimage B-Tree opclass support functions.

Invent the concept of a B-Tree equalimage ("equality is image equality")
support function, registered as support function 4.  A B-Tree operator
class may use such a support function to indicate that it is safe (or
not safe) to replace its equality operator with a generic "image
equality" function.  This means that two equal datums can only be equal
when bitwise identical after detoasting.  This is infrastructure for an
upcoming patch that adds B-Tree deduplication, though it is anticipated
that it will eventually be used in other areas.

Add an equalimage routine to almost all of the existing B-Tree
opclasses.  Most can just use a generic equalimage  routine that
indicates that deduplication is safe unconditionally.

Author: Peter Geoghegan, Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
---
 src/include/access/nbtree.h                 | 23 +++++--
 src/include/catalog/pg_amproc.dat           | 62 +++++++++++++++++
 src/include/catalog/pg_proc.dat             |  8 +++
 src/backend/access/nbtree/nbtutils.c        | 73 +++++++++++++++++++++
 src/backend/access/nbtree/nbtvalidate.c     |  8 ++-
 src/backend/commands/opclasscmds.c          | 30 ++++++++-
 src/backend/utils/adt/datum.c               | 17 +++++
 src/backend/utils/adt/name.c                | 31 +++++++++
 src/backend/utils/adt/varchar.c             | 15 +++++
 src/backend/utils/adt/varlena.c             | 15 +++++
 doc/src/sgml/btree.sgml                     | 59 ++++++++++++++++-
 doc/src/sgml/ref/alter_opfamily.sgml        |  7 +-
 doc/src/sgml/xindex.sgml                    | 19 ++++--
 src/test/regress/expected/alter_generic.out |  4 +-
 src/test/regress/expected/opr_sanity.out    | 35 ++++++++++
 src/test/regress/sql/opr_sanity.sql         | 15 +++++
 16 files changed, 400 insertions(+), 21 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 20ace69dab..d520066914 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -380,19 +380,29 @@ typedef struct BTMetaPageData
  *	must return < 0, 0, > 0, respectively, in these three cases.
  *
  *	To facilitate accelerated sorting, an operator class may choose to
- *	offer a second procedure (BTSORTSUPPORT_PROC).  For full details, see
- *	src/include/utils/sortsupport.h.
+ *	offer a sortsupport amproc procedure (BTSORTSUPPORT_PROC).  For full
+ *	details, see src/include/utils/sortsupport.h.
  *
  *	To support window frames defined by "RANGE offset PRECEDING/FOLLOWING",
- *	an operator class may choose to offer a third amproc procedure
- *	(BTINRANGE_PROC), independently of whether it offers sortsupport.
- *	For full details, see doc/src/sgml/btree.sgml.
+ *	an operator class may choose to offer an in_range amproc procedure
+ *	(BTINRANGE_PROC).  For full details, see doc/src/sgml/btree.sgml.
+ *
+ *	To support B-Tree deduplication (and possibly other optimizations), an
+ *	operator class may choose to offer an "equality is image equality" proc
+ *	(BTEQUALIMAGE_PROC).  When the procedure returns true, core code can
+ *	assume that any two opclass-equal datums must also be equivalent in
+ *	every way.  When the procedure returns false (or when there is no
+ *	procedure for an opclass), deduplication cannot proceed because equal
+ *	index tuples might be visibly different (e.g. btree/numeric_ops indexes
+ *	can't support deduplication because "5" is equal to but distinct from
+ *	"5.00").  For full details, see doc/src/sgml/btree.sgml.
  */
 
 #define BTORDER_PROC		1
 #define BTSORTSUPPORT_PROC	2
 #define BTINRANGE_PROC		3
-#define BTNProcs			3
+#define BTEQUALIMAGE_PROC	4
+#define BTNProcs			4
 
 /*
  *	We need to be able to tell the difference between read and write
@@ -829,6 +839,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_allequalimage(Relation rel, bool debugmessage);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index c67768fcab..4728479978 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -17,23 +17,35 @@
   amprocrighttype => 'anyarray', amprocnum => '1', amproc => 'btarraycmp' },
 { amprocfamily => 'btree/bit_ops', amproclefttype => 'bit',
   amprocrighttype => 'bit', amprocnum => '1', amproc => 'bitcmp' },
+{ amprocfamily => 'btree/bit_ops', amproclefttype => 'bit',
+  amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
   amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+  amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '1', amproc => 'bpcharcmp' },
 { amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '2',
   amproc => 'bpchar_sortsupport' },
+{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
+  amprocrighttype => 'bpchar', amprocnum => '4', amproc => 'bpchar_equalimage' },
 { amprocfamily => 'btree/bytea_ops', amproclefttype => 'bytea',
   amprocrighttype => 'bytea', amprocnum => '1', amproc => 'byteacmp' },
 { amprocfamily => 'btree/bytea_ops', amproclefttype => 'bytea',
   amprocrighttype => 'bytea', amprocnum => '2', amproc => 'bytea_sortsupport' },
+{ amprocfamily => 'btree/bytea_ops', amproclefttype => 'bytea',
+  amprocrighttype => 'bytea', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/char_ops', amproclefttype => 'char',
   amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+  amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
   amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
   amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+  amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
   amprocrighttype => 'timestamp', amprocnum => '1',
   amproc => 'date_cmp_timestamp' },
@@ -45,6 +57,9 @@
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
   amprocrighttype => 'timestamp', amprocnum => '2',
   amproc => 'timestamp_sortsupport' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
+  amprocrighttype => 'timestamp', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
   amprocrighttype => 'date', amprocnum => '1', amproc => 'timestamp_cmp_date' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
@@ -56,6 +71,9 @@
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamptz',
   amprocrighttype => 'timestamptz', amprocnum => '2',
   amproc => 'timestamp_sortsupport' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamptz',
+  amprocrighttype => 'timestamptz', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamptz',
   amprocrighttype => 'date', amprocnum => '1',
   amproc => 'timestamptz_cmp_date' },
@@ -96,10 +114,15 @@
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '2',
   amproc => 'network_sortsupport' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+  amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -117,6 +140,8 @@
   amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint4cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
   amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+  amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
   amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -134,6 +159,8 @@
   amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint8cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
   amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+  amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
   amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -146,11 +173,17 @@
 { amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
   amprocrighttype => 'interval', amprocnum => '3',
   amproc => 'in_range(interval,interval,interval,bool,bool)' },
+{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
+  amprocrighttype => 'interval', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
   amprocrighttype => 'macaddr', amprocnum => '1', amproc => 'macaddr_cmp' },
 { amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
   amprocrighttype => 'macaddr', amprocnum => '2',
   amproc => 'macaddr_sortsupport' },
+{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
+  amprocrighttype => 'macaddr', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/numeric_ops', amproclefttype => 'numeric',
   amprocrighttype => 'numeric', amprocnum => '1', amproc => 'numeric_cmp' },
 { amprocfamily => 'btree/numeric_ops', amproclefttype => 'numeric',
@@ -163,60 +196,89 @@
   amprocrighttype => 'oid', amprocnum => '1', amproc => 'btoidcmp' },
 { amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
   amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+  amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
   amprocrighttype => 'oidvector', amprocnum => '1',
   amproc => 'btoidvectorcmp' },
+{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
+  amprocrighttype => 'oidvector', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '1', amproc => 'bttextcmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '2', amproc => 'bttextsortsupport' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'text',
+  amprocrighttype => 'text', amprocnum => '4', amproc => 'bttextequalimage' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
   amprocrighttype => 'name', amprocnum => '1', amproc => 'btnamecmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
   amprocrighttype => 'name', amprocnum => '2', amproc => 'btnamesortsupport' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'name',
+  amprocrighttype => 'name', amprocnum => '4', amproc => 'btnameequalimage' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
   amprocrighttype => 'text', amprocnum => '1', amproc => 'btnametextcmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'text',
   amprocrighttype => 'name', amprocnum => '1', amproc => 'bttextnamecmp' },
 { amprocfamily => 'btree/time_ops', amproclefttype => 'time',
   amprocrighttype => 'time', amprocnum => '1', amproc => 'time_cmp' },
+{ amprocfamily => 'btree/time_ops', amproclefttype => 'time',
+  amprocrighttype => 'time', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/time_ops', amproclefttype => 'time',
   amprocrighttype => 'interval', amprocnum => '3',
   amproc => 'in_range(time,time,interval,bool,bool)' },
 { amprocfamily => 'btree/timetz_ops', amproclefttype => 'timetz',
   amprocrighttype => 'timetz', amprocnum => '1', amproc => 'timetz_cmp' },
+{ amprocfamily => 'btree/timetz_ops', amproclefttype => 'timetz',
+  amprocrighttype => 'timetz', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/timetz_ops', amproclefttype => 'timetz',
   amprocrighttype => 'interval', amprocnum => '3',
   amproc => 'in_range(timetz,timetz,interval,bool,bool)' },
 { amprocfamily => 'btree/varbit_ops', amproclefttype => 'varbit',
   amprocrighttype => 'varbit', amprocnum => '1', amproc => 'varbitcmp' },
+{ amprocfamily => 'btree/varbit_ops', amproclefttype => 'varbit',
+  amprocrighttype => 'varbit', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/text_pattern_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '1', amproc => 'bttext_pattern_cmp' },
 { amprocfamily => 'btree/text_pattern_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '2',
   amproc => 'bttext_pattern_sortsupport' },
+{ amprocfamily => 'btree/text_pattern_ops', amproclefttype => 'text',
+  amprocrighttype => 'text', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/bpchar_pattern_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '1',
   amproc => 'btbpchar_pattern_cmp' },
 { amprocfamily => 'btree/bpchar_pattern_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '2',
   amproc => 'btbpchar_pattern_sortsupport' },
+{ amprocfamily => 'btree/bpchar_pattern_ops', amproclefttype => 'bpchar',
+  amprocrighttype => 'bpchar', amprocnum => '4',
+  amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/money_ops', amproclefttype => 'money',
   amprocrighttype => 'money', amprocnum => '1', amproc => 'cash_cmp' },
 { amprocfamily => 'btree/tid_ops', amproclefttype => 'tid',
   amprocrighttype => 'tid', amprocnum => '1', amproc => 'bttidcmp' },
+{ amprocfamily => 'btree/tid_ops', amproclefttype => 'tid',
+  amprocrighttype => 'tid', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
   amprocrighttype => 'uuid', amprocnum => '1', amproc => 'uuid_cmp' },
 { amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
   amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+  amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/record_ops', amproclefttype => 'record',
   amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
 { amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
   amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordimagecmp' },
 { amprocfamily => 'btree/pg_lsn_ops', amproclefttype => 'pg_lsn',
   amprocrighttype => 'pg_lsn', amprocnum => '1', amproc => 'pg_lsn_cmp' },
+{ amprocfamily => 'btree/pg_lsn_ops', amproclefttype => 'pg_lsn',
+  amprocrighttype => 'pg_lsn', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/macaddr8_ops', amproclefttype => 'macaddr8',
   amprocrighttype => 'macaddr8', amprocnum => '1', amproc => 'macaddr8_cmp' },
+{ amprocfamily => 'btree/macaddr8_ops', amproclefttype => 'macaddr8',
+  amprocrighttype => 'macaddr8', amprocnum => '4', amproc => 'btequalimagedatum' },
 { amprocfamily => 'btree/enum_ops', amproclefttype => 'anyenum',
   amprocrighttype => 'anyenum', amprocnum => '1', amproc => 'enum_cmp' },
 { amprocfamily => 'btree/tsvector_ops', amproclefttype => 'tsvector',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 226c904c04..c8fcff0fde 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1007,12 +1007,16 @@
 { oid => '3135', descr => 'sort support',
   proname => 'btnamesortsupport', prorettype => 'void',
   proargtypes => 'internal', prosrc => 'btnamesortsupport' },
+{ oid => '8504', descr => 'equal image', proname => 'btnameequalimage',
+  prorettype => 'bool', proargtypes => '', prosrc => 'btnameequalimage' },
 { oid => '360', descr => 'less-equal-greater',
   proname => 'bttextcmp', proleakproof => 't', prorettype => 'int4',
   proargtypes => 'text text', prosrc => 'bttextcmp' },
 { oid => '3255', descr => 'sort support',
   proname => 'bttextsortsupport', prorettype => 'void',
   proargtypes => 'internal', prosrc => 'bttextsortsupport' },
+{ oid => '8505', descr => 'equal image', proname => 'bttextequalimage',
+  prorettype => 'bool', proargtypes => '', prosrc => 'bttextequalimage' },
 { oid => '377', descr => 'less-equal-greater',
   proname => 'cash_cmp', proleakproof => 't', prorettype => 'int4',
   proargtypes => 'money money', prosrc => 'cash_cmp' },
@@ -2091,6 +2095,8 @@
 { oid => '3328', descr => 'sort support',
   proname => 'bpchar_sortsupport', prorettype => 'void',
   proargtypes => 'internal', prosrc => 'bpchar_sortsupport' },
+{ oid => '8506', descr => 'equal image', proname => 'bpchar_equalimage',
+  prorettype => 'bool', proargtypes => '', prosrc => 'bpchar_equalimage' },
 { oid => '1080', descr => 'hash',
   proname => 'hashbpchar', prorettype => 'int4', proargtypes => 'bpchar',
   prosrc => 'hashbpchar' },
@@ -9484,6 +9490,8 @@
 { oid => '3187', descr => 'less-equal-greater based on byte images',
   proname => 'btrecordimagecmp', prorettype => 'int4',
   proargtypes => 'record record', prosrc => 'btrecordimagecmp' },
+{ oid => '8507', descr => 'equal image', proname => 'btequalimagedatum',
+  prorettype => 'bool', proargtypes => '', prosrc => 'btequalimagedatum' },
 
 # Extensions
 { oid => '3082', descr => 'list available extensions',
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..c9f0402f8e 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -2566,3 +2567,75 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Are all attributes in rel "equality is image equality" attributes?
+ *
+ * We use each attribute's BTEQUALIMAGE_PROC opclass procedure.  If any
+ * opclass either lacks a BTEQUALIMAGE_PROC procedure or returns false, we
+ * return false; otherwise we return true.
+ *
+ * Returned boolean value is stored in index metapage during index builds.
+ * Deduplication can only be used when we return true.
+ */
+bool
+_bt_allequalimage(Relation rel, bool debugmessage)
+{
+	bool		allequalimage = true;
+
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(rel) !=
+		IndexRelationGetNumberOfKeyAttributes(rel))
+		return false;
+
+	/*
+	 * There is no special reason why deduplication cannot work with system
+	 * relations (i.e. with system catalog indexes and TOAST indexes).  We
+	 * deem deduplication unsafe for these indexes all the same, since the
+	 * alternative is to force users to always use deduplication, without
+	 * being able to opt out.  (ALTER INDEX is not supported with system
+	 * indexes, so users would have no way to set the deduplicate_items
+	 * storage parameter to 'off'.)
+	 */
+	if (IsSystemRelation(rel))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(rel); i++)
+	{
+		Oid			opfamily = rel->rd_opfamily[i];
+		Oid			opcintype = rel->rd_opcintype[i];
+		Oid			collation = rel->rd_indcollation[i];
+		Oid			equalimageproc;
+
+		equalimageproc = get_opfamily_proc(opfamily, opcintype, opcintype,
+										   BTEQUALIMAGE_PROC);
+
+		/*
+		 * If there is no BTEQUALIMAGE_PROC then deduplication is assumed to
+		 * be unsafe.  Otherwise, actually call proc and see what it says.
+		 */
+		if (!OidIsValid(equalimageproc) ||
+			!DatumGetBool(OidFunctionCall0Coll(equalimageproc, collation)))
+		{
+			allequalimage = false;
+			break;
+		}
+	}
+
+	/*
+	 * Don't ereport() until here to avoid reporting on a system relation
+	 * index or an INCLUDE index
+	 */
+	if (debugmessage)
+	{
+		if (allequalimage)
+			elog(DEBUG1, "index \"%s\" can safely use deduplication",
+				 RelationGetRelationName(rel));
+		else
+			ereport(NOTICE,
+					(errmsg("index \"%s\" cannot use deduplication",
+							RelationGetRelationName(rel))));
+	}
+
+	return allequalimage;
+}
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index ff634b1649..effd04d32e 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -104,6 +104,10 @@ btvalidate(Oid opclassoid)
 											procform->amprocrighttype,
 											BOOLOID, BOOLOID);
 				break;
+			case BTEQUALIMAGE_PROC:
+				ok = check_amproc_signature(procform->amproc, BOOLOID, true,
+											0, 0);
+				break;
 			default:
 				ereport(INFO,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
@@ -211,8 +215,8 @@ btvalidate(Oid opclassoid)
 
 		/*
 		 * Complain if there seems to be an incomplete set of either operators
-		 * or support functions for this datatype pair.  The only things
-		 * considered optional are the sortsupport and in_range functions.
+		 * or support functions for this datatype pair.  The sortsupport,
+		 * in_range, and equalimage functions are considered optional.
 		 */
 		if (thisgroup->operatorset !=
 			((1 << BTLessStrategyNumber) |
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index e2c6de457c..f63c02d5e9 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1143,9 +1143,10 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid)
 	/*
 	 * btree comparison procs must be 2-arg procs returning int4.  btree
 	 * sortsupport procs must take internal and return void.  btree in_range
-	 * procs must be 5-arg procs returning bool.  hash support proc 1 must be
-	 * a 1-arg proc returning int4, while proc 2 must be a 2-arg proc
-	 * returning int8.  Otherwise we don't know.
+	 * procs must be 5-arg procs returning bool.  btree equalimage procs must
+	 * take no args and return bool.  hash support proc 1 must be a 1-arg
+	 * proc returning int4, while proc 2 must be a 2-arg proc returning int8.
+	 * Otherwise we don't know.
 	 */
 	if (amoid == BTREE_AM_OID)
 	{
@@ -1205,6 +1206,29 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid)
 			if (!OidIsValid(member->righttype))
 				member->righttype = procform->proargtypes.values[2];
 		}
+		else if (member->number == BTEQUALIMAGE_PROC)
+		{
+			if (procform->pronargs != 0)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("btree equal image functions must have no arguments")));
+			if (procform->prorettype != BOOLOID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("btree equal image functions must return boolean")));
+			/*
+			 * pg_amproc functions are indexed by (lefttype, righttype), but
+			 * an equalimage function can only be called at CREATE INDEX time.
+			 * The same opclass opcintype OID is always used for leftype and
+			 * righttype.  Providing a cross-type routine isn't sensible.
+			 * Reject cross-type ALTER OPERATOR FAMILY ...  ADD FUNCTION 4
+			 * statements here.
+			 */
+			if (member->lefttype != member->righttype)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("btree equal image function cannot be a cross-type function")));
+		}
 	}
 	else if (amoid == HASH_AM_OID)
 	{
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 4e81947352..2881736cd6 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -44,6 +44,7 @@
 
 #include "access/detoast.h"
 #include "fmgr.h"
+#include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/expandeddatum.h"
 
@@ -323,6 +324,22 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 	return result;
 }
 
+/*-------------------------------------------------------------------------
+ * btequalimagedatum
+ *
+ * Generic "equalimage" support function -- always returns true.
+ *
+ * B-Tree operator classes whose equality function could safely be replaced by
+ * datum_image_eq() in all cases can use this as their "equalimage" support
+ * function.
+ *-------------------------------------------------------------------------
+ */
+Datum
+btequalimagedatum(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(true);
+}
+
 /*-------------------------------------------------------------------------
  * datumEstimateSpace
  *
diff --git a/src/backend/utils/adt/name.c b/src/backend/utils/adt/name.c
index 6749e75c89..16a60626be 100644
--- a/src/backend/utils/adt/name.c
+++ b/src/backend/utils/adt/name.c
@@ -29,6 +29,7 @@
 #include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/pg_locale.h"
 #include "utils/varlena.h"
 
 
@@ -224,6 +225,36 @@ btnamesortsupport(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+static void
+check_collation_set(Oid collid)
+{
+	if (!OidIsValid(collid))
+	{
+		/*
+		 * This typically means that the parser could not resolve a conflict
+		 * of implicit collations, so report it that way.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_INDETERMINATE_COLLATION),
+				 errmsg("could not determine which collation to use for string comparison"),
+				 errhint("Use the COLLATE clause to set the collation explicitly.")));
+	}
+}
+
+Datum
+btnameequalimage(PG_FUNCTION_ARGS)
+{
+	Oid			collid = PG_GET_COLLATION();
+
+	check_collation_set(collid);
+
+	if (lc_collate_is_c(collid) ||
+		collid == DEFAULT_COLLATION_OID ||
+		get_collation_isdeterministic(collid))
+		PG_RETURN_BOOL(true);
+	else
+		PG_RETURN_BOOL(false);
+}
 
 /*****************************************************************************
  *	 MISCELLANEOUS PUBLIC ROUTINES											 *
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 1e1239a1ba..fbe28ffb05 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -936,6 +936,21 @@ bpchar_sortsupport(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+Datum
+bpchar_equalimage(PG_FUNCTION_ARGS)
+{
+	Oid			collid = PG_GET_COLLATION();
+
+	check_collation_set(collid);
+
+	if (lc_collate_is_c(collid) ||
+		collid == DEFAULT_COLLATION_OID ||
+		get_collation_isdeterministic(collid))
+		PG_RETURN_BOOL(true);
+	else
+		PG_RETURN_BOOL(false);
+}
+
 Datum
 bpchar_larger(PG_FUNCTION_ARGS)
 {
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 1b351cbc68..cc39035c0e 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1953,6 +1953,21 @@ bttextsortsupport(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+Datum
+bttextequalimage(PG_FUNCTION_ARGS)
+{
+	Oid			collid = PG_GET_COLLATION();
+
+	check_collation_set(collid);
+
+	if (lc_collate_is_c(collid) ||
+		collid == DEFAULT_COLLATION_OID ||
+		get_collation_isdeterministic(collid))
+		PG_RETURN_BOOL(true);
+	else
+		PG_RETURN_BOOL(false);
+}
+
 /*
  * Generic sortsupport interface for character type's operator classes.
  * Includes locale support, and support for BpChar semantics (i.e. removing
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index ac6c4423e6..e3c69f8de6 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -207,7 +207,7 @@
 
  <para>
   As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
-  one required and two optional support functions.  The three
+  one required and three optional support functions.  The four
   user-defined methods are:
  </para>
  <variablelist>
@@ -456,6 +456,63 @@ returns bool
     </para>
    </listitem>
   </varlistentry>
+  <varlistentry>
+   <term><function>equalimage</function></term>
+   <listitem>
+    <para>
+     Optionally, a btree operator family may provide
+     <function>equalimage</function> (<quote>equality is image
+     equality</quote>) support functions, registered under support
+     function number 4.  These functions allow B-Tree to apply
+     optimizations that assume that any two datums considered equal by
+     a corresponding <function>order</function> method must also be
+     equivalent in every way.  For example, the output function for
+     the underlying type must always return the same
+     <type>cstring</type>.  When there is no
+     <function>equalimage</function> procedure, or when the procedure
+     returns <literal>false</literal>, the B-Tree implementation
+     assumes that equal index tuples might be visibly different.  This
+     prevents an affected B-Tree index from using the deduplication
+     optimization.
+    </para>
+    <para>
+     An <function>equalimage</function> function must have the
+     signature
+<synopsis>
+equalimage() returns bool
+</synopsis>
+     Note that there are no arguments to the function.  Even still, if
+     the indexed values are of a collatable data type, the appropriate
+     collation OID will be passed to the
+     <function>equalimage</function> function, using the standard
+     <function>PG_GET_COLLATION()</function> mechanism.
+    </para>
+    <para>
+     When an operator class's <function>equalimage</function> function
+     returns <literal>true</literal>, the binary representation of two
+     datums that are considered equal by the operator class's
+     <function>order</function> procedure are generally also bitwise
+     equal.  However, when indexing a varlena datatype, the on-disk
+     representation may not be identical due to inconsistent
+     application of <acronym>TOAST</acronym> compression on input.
+     Formally, when an operator class's
+     <function>equalimage</function> function returns
+     <literal>true</literal>, it is safe to assume that the
+     <literal>datum_image_eq()</literal> C function will always give
+     the same answer as the operator class's <literal>=</literal>
+     operator for any possible set of inputs (assuming the
+     <literal>=</literal> operator is invoked using the collation
+     reported to the <function>equalimage</function> function).
+    </para>
+    <para>
+     A generic <function>equalimage</function> function that returns
+     <literal>true</literal> unconditionally can often be used when
+     authoring a new operator class/family;
+     <function>btequalimagedatum()</function> is provided for this
+     purpose.
+    </para>
+   </listitem>
+  </varlistentry>
  </variablelist>
 
 </sect1>
diff --git a/doc/src/sgml/ref/alter_opfamily.sgml b/doc/src/sgml/ref/alter_opfamily.sgml
index 848156c9d7..4ac1cca95a 100644
--- a/doc/src/sgml/ref/alter_opfamily.sgml
+++ b/doc/src/sgml/ref/alter_opfamily.sgml
@@ -153,9 +153,10 @@ ALTER OPERATOR FAMILY <replaceable>name</replaceable> USING <replaceable class="
       and hash functions it is not necessary to specify <replaceable
       class="parameter">op_type</replaceable> since the function's input
       data type(s) are always the correct ones to use.  For B-tree sort
-      support functions and all functions in GiST, SP-GiST and GIN operator
-      classes, it is necessary to specify the operand data type(s) the function
-      is to be used with.
+      support functions, B-Tree equal image functions, and all
+      functions in GiST, SP-GiST and GIN operator classes, it is
+      necessary to specify the operand data type(s) the function is to
+      be used with.
      </para>
 
      <para>
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index ffb5164aaa..59b1e37163 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -402,7 +402,7 @@
 
   <para>
    B-trees require a comparison support function,
-   and allow two additional support functions to be
+   and allow three additional support functions to be
    supplied at the operator class author's option, as shown in <xref
    linkend="xindex-btree-support-table"/>.
    The requirements for these support functions are explained further in
@@ -441,6 +441,14 @@
        </entry>
        <entry>3</entry>
       </row>
+      <row>
+       <entry>
+        Determine if it is generally safe to apply optimizations that
+        assume that any two equal keys must also be "image equal";
+        this makes the two keys totally interchangeable (optional)
+       </entry>
+       <entry>4</entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
@@ -980,7 +988,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
   OPERATOR 5 > ,
   FUNCTION 1 btint8cmp(int8, int8) ,
   FUNCTION 2 btint8sortsupport(internal) ,
-  FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ;
+  FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
+  FUNCTION 4 btequalimagedatum() ;
 
 CREATE OPERATOR CLASS int4_ops
 DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -992,7 +1001,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
   OPERATOR 5 > ,
   FUNCTION 1 btint4cmp(int4, int4) ,
   FUNCTION 2 btint4sortsupport(internal) ,
-  FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ;
+  FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
+  FUNCTION 4 btequalimagedatum() ;
 
 CREATE OPERATOR CLASS int2_ops
 DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1004,7 +1014,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
   OPERATOR 5 > ,
   FUNCTION 1 btint2cmp(int2, int2) ,
   FUNCTION 2 btint2sortsupport(internal) ,
-  FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ;
+  FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
+  FUNCTION 4 btequalimagedatum() ;
 
 ALTER OPERATOR FAMILY integer_ops USING btree ADD
   -- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ac5183c90e..022dec82a3 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -354,9 +354,9 @@ ERROR:  invalid operator number 0, must be between 1 and 5
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
 ERROR:  operator argument types must be specified in ALTER OPERATOR FAMILY
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR:  invalid function number 0, must be between 1 and 3
+ERROR:  invalid function number 0, must be between 1 and 4
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR:  invalid function number 6, must be between 1 and 3
+ERROR:  invalid function number 6, must be between 1 and 4
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
 ERROR:  STORAGE cannot be specified in ALTER OPERATOR FAMILY
 DROP OPERATOR FAMILY alt_opf4 USING btree;
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index c19740e5db..0974098fd3 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2111,6 +2111,41 @@ WHERE p1.amproc = p2.oid AND
 --------------+--------+--------
 (0 rows)
 
+-- Almost all Btree opclasses can use the generic btequalimagedatum function
+-- as their equalimage proc (support function 4).  Look for opclasses that
+-- don't do so; newly added Btree opclasses will usually be able to support
+-- deduplication with little trouble.
+SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name,
+       opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
+FROM pg_am am
+JOIN pg_opclass opc ON opc.opcmethod = am.oid
+JOIN pg_opfamily opf ON opc.opcfamily = opf.oid
+LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND
+    amproclefttype = opcintype AND
+    amprocnum = 4
+WHERE am.amname = 'btree' AND
+    amproc IS DISTINCT FROM 'btequalimagedatum'::regproc
+ORDER BY amproc::regproc::text, opfamily_name, opclass_name;
+       proc        |  opfamily_name   |   opclass_name   |    opcintype     
+-------------------+------------------+------------------+------------------
+ bpchar_equalimage | bpchar_ops       | bpchar_ops       | character
+ btnameequalimage  | text_ops         | name_ops         | name
+ bttextequalimage  | text_ops         | text_ops         | text
+ bttextequalimage  | text_ops         | varchar_ops      | text
+                   | array_ops        | array_ops        | anyarray
+                   | enum_ops         | enum_ops         | anyenum
+                   | float_ops        | float4_ops       | real
+                   | float_ops        | float8_ops       | double precision
+                   | jsonb_ops        | jsonb_ops        | jsonb
+                   | money_ops        | money_ops        | money
+                   | numeric_ops      | numeric_ops      | numeric
+                   | range_ops        | range_ops        | anyrange
+                   | record_image_ops | record_image_ops | record
+                   | record_ops       | record_ops       | record
+                   | tsquery_ops      | tsquery_ops      | tsquery
+                   | tsvector_ops     | tsvector_ops     | tsvector
+(16 rows)
+
 -- **************** pg_index ****************
 -- Look for illegal values in pg_index fields.
 SELECT p1.indexrelid, p1.indrelid
diff --git a/src/test/regress/sql/opr_sanity.sql b/src/test/regress/sql/opr_sanity.sql
index 624bea46ce..bd8e79737b 100644
--- a/src/test/regress/sql/opr_sanity.sql
+++ b/src/test/regress/sql/opr_sanity.sql
@@ -1323,6 +1323,21 @@ WHERE p1.amproc = p2.oid AND
     p1.amproclefttype != p1.amprocrighttype AND
     p2.provolatile = 'v';
 
+-- Almost all Btree opclasses can use the generic btequalimagedatum function
+-- as their equalimage proc (support function 4).  Look for opclasses that
+-- don't do so; newly added Btree opclasses will usually be able to support
+-- deduplication with little trouble.
+SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name,
+       opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
+FROM pg_am am
+JOIN pg_opclass opc ON opc.opcmethod = am.oid
+JOIN pg_opfamily opf ON opc.opcfamily = opf.oid
+LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND
+    amproclefttype = opcintype AND
+    amprocnum = 4
+WHERE am.amname = 'btree' AND
+    amproc IS DISTINCT FROM 'btequalimagedatum'::regproc
+ORDER BY amproc::regproc::text, opfamily_name, opclass_name;
 
 -- **************** pg_index ****************
 
-- 
2.17.1

v33-0003-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/x-patch; name=v33-0003-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From 92c050502c90b8a3e212fc678ecb075d82671145 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v33 3/4] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 119 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   7 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  53 ++++++++
 doc/src/sgml/pageinspect.sgml                 |  83 ++++++------
 4 files changed, 207 insertions(+), 55 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 564c818558..f4aac890f5 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -510,7 +585,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[8];
+	char	   *values[9];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -557,17 +632,21 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get values of extended metadata if available, use default values
-	 * otherwise.
+	 * otherwise.  Note that we rely on the assumption that btm_allequalimage
+	 * is initialized to zero on databases that were initdb'd before Postgres
+	 * 13.
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
+		values[j++] = metad->btm_allequalimage ? "t" : "f";
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
+		values[j++] = "f";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..17bf0c5470 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -12,6 +12,7 @@ fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
+allequalimage           | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
@@ -41,6 +42,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +58,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..e34c214c93 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,56 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int4,
+    OUT level int4,
+    OUT fastroot int4,
+    OUT fastlevel int4,
+    OUT oldest_xact int4,
+    OUT last_cleanup_num_tuples real,
+    OUT allequalimage boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..9558421c2f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -300,13 +300,14 @@ test=# SELECT t_ctid, raw_flags, combined_flags
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 582
 last_cleanup_num_tuples | 1000
+allequalimage           | f
 </screen>
      </para>
     </listitem>
@@ -329,11 +330,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +357,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +415,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v33-0004-DEBUG-Show-index-values-in-pageinspect.patchapplication/x-patch; name=v33-0004-DEBUG-Show-index-values-in-pageinspect.patchDownload

From de5ed9814f18cf70683fd963c091723274b14982 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v33 4/4] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 64 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index f4aac890f5..9074033619 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -245,6 +245,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +262,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +297,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +461,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +499,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +547,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 17bf0c5470..92ad8eb1a9 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,7 +41,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v33-0002-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v33-0002-Add-deduplication-to-nbtree.patchDownload

From aa0d0b022f8791a46f7eba9e607845049714b403 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 25 Jan 2020 14:40:46 -0800
Subject: [PATCH v33 2/4] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
"posting list tuples" are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.

Deduplication merges together duplicates that happen to have been
created by an UPDATE that did not use an optimization like heapam's
Heap-only tuples (HOT).  Deduplication is effective at absorbing
"version bloat" without any special knowledge of row versions or of
MVCC.  Deduplication is applied within unique indexes for this reason,
though the criteria for triggering a deduplication is slightly
different.  Deduplication of a unique index is triggered only when the
incoming item is a duplicate of an existing item (and when the page
would otherwise split), which is a sure sign of "version bloat".

The lazy approach taken by nbtree has significant advantages over a
GIN style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.  A
new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all B-Tree indexes use
deduplication when only deduplication safe operator classes are used.
We should review this decision at the end of the Postgres 13 beta
period.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly lessen the ongoing cost of
vacuuming indexes.

There is a regression of approximately 2% of transaction throughput with
workloads that consist of append-only inserts into a table with several
non-unique indexes, where all indexes have few or no repeated values.
This is tentatively considered to be an acceptable downside to enabling
deduplication by default.  Again, the final word on this will come at
the end of the beta period, when we get some feedback from users.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since deduplication only affects the physical
representation of tuples.  However, users must still REINDEX a
pg_upgrade'd index to before its leaf page splits will apply
deduplication.  An index build is the only way to set the new nbtree
metapage flag indicating that deduplication is generally safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h                   | 435 +++++++--
 src/include/access/nbtxlog.h                  | 117 ++-
 src/include/access/rmgrlist.h                 |   2 +-
 src/backend/access/common/reloptions.c        |   9 +
 src/backend/access/index/genam.c              |   4 +
 src/backend/access/nbtree/Makefile            |   1 +
 src/backend/access/nbtree/README              | 133 ++-
 src/backend/access/nbtree/nbtdedup.c          | 830 ++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c         | 387 ++++++--
 src/backend/access/nbtree/nbtpage.c           | 246 +++++-
 src/backend/access/nbtree/nbtree.c            | 171 +++-
 src/backend/access/nbtree/nbtsearch.c         | 271 +++++-
 src/backend/access/nbtree/nbtsort.c           | 193 +++-
 src/backend/access/nbtree/nbtsplitloc.c       |  39 +-
 src/backend/access/nbtree/nbtutils.c          | 197 ++++-
 src/backend/access/nbtree/nbtxlog.c           | 268 +++++-
 src/backend/access/rmgrdesc/nbtdesc.c         |  22 +-
 src/backend/storage/page/bufpage.c            |   9 +-
 src/bin/psql/tab-complete.c                   |   4 +-
 contrib/amcheck/verify_nbtree.c               | 231 ++++-
 contrib/citext/expected/citext_1.out          |   2 +
 contrib/hstore/expected/hstore.out            |   1 +
 contrib/ltree/expected/ltree.out              |   1 +
 doc/src/sgml/btree.sgml                       | 185 +++-
 doc/src/sgml/charset.sgml                     |   9 +-
 doc/src/sgml/func.sgml                        |   9 +-
 doc/src/sgml/ref/create_index.sgml            |  44 +-
 src/test/regress/expected/alter_table.out     |   4 +
 src/test/regress/expected/arrays.out          |   1 +
 src/test/regress/expected/btree_index.out     |  20 +-
 .../regress/expected/collate.icu.utf8.out     |   3 +
 src/test/regress/expected/create_index.out    |   4 +
 src/test/regress/expected/domain.out          |   2 +
 src/test/regress/expected/enum.out            |   2 +
 src/test/regress/expected/foreign_key.out     |   4 +
 src/test/regress/expected/indexing.out        |   1 +
 src/test/regress/expected/join.out            |   1 +
 src/test/regress/expected/jsonb.out           |   1 +
 src/test/regress/expected/matview.out         |   3 +
 src/test/regress/expected/psql.out            |   1 +
 src/test/regress/expected/rangetypes.out      |   2 +
 src/test/regress/expected/stats_ext.out       |   1 +
 src/test/regress/expected/transactions.out    |   1 +
 src/test/regress/expected/tsearch.out         |   1 +
 src/test/regress/sql/btree_index.sql          |  22 +-
 45 files changed, 3564 insertions(+), 330 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d520066914..5652ef2bcc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -108,6 +108,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_allequalimage;	/* are all columns "equalimage"? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -124,6 +125,14 @@ typedef struct BTMetaPageData
  * need to be immediately re-indexed at pg_upgrade.  In order to get the
  * new heapkeyspace semantics, however, a REINDEX is needed.
  *
+ * Deduplication is safe to use when the btm_allequalimage field is set to
+ * true.  It's safe to read the btm_allequalimage field on version 3, but
+ * only version 4 indexes make use of deduplication.  Even version 4
+ * indexes created on PostgreSQL v12 will need a REINDEX to make use of
+ * deduplication, though, since there is no other way to set
+ * btm_allequalimage to true (pg_upgrade hasn't been taught to set the
+ * metapage field).
+ *
  * Btree version 2 is mostly the same as version 3.  There are two new
  * fields in the metapage that were introduced in version 3.  A version 2
  * metapage will be automatically upgraded to version 3 on the first
@@ -156,6 +165,21 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
+ * that may be stored on a btree leaf page.  It is used to size the
+ * per-page temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-tuple overheads here to keep
+ * things simple (value is based on how many elements a single array of
+ * heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.
+ */
+#define MaxTIDsPerBTreePage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +254,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -264,7 +287,8 @@ typedef struct BTMetaPageData
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
  * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
- * t_info is always set on BTREE_VERSION 4 pivot tuples.
+ * t_info is always set on BTREE_VERSION 4 pivot tuples, since
+ * BTreeTupleIsPivot() must work reliably on heapkeyspace versions.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -279,90 +303,256 @@ typedef struct BTMetaPageData
  * The 12 least significant offset bits from t_tid are used to represent
  * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
- * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
- * number of columns/attributes <= INDEX_MAX_KEYS.
+ * future use.  BT_OFFSET_MASK should be large enough to store any number
+ * of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  PostgreSQL v13 introduced a
+ * new non-pivot tuple format to support deduplication: posting list
+ * tuples.  Deduplication merges together multiple equal non-pivot tuples
+ * into a logically equivalent, space efficient representation.  A posting
+ * list is an array of ItemPointerData elements.  Non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).  BT_OFFSET_MASK should be large enough to store
+ * any number of posting list TIDs that might be present in a tuple (since
+ * tuple size is subject to the INDEX_SIZE_MASK limit).
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
-#define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_OFFSET_MASK				0x0FFF
 #define BT_PIVOT_HEAP_TID_ATTR		0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined as a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
-			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_PIVOT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_PIVOT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -444,6 +634,9 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * allequalimage is set to indicate that deduplication is safe for the index.
+ * This is also a property of the index relation rather than an indexscan.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -479,6 +672,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		allequalimage;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -517,10 +711,94 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * State used to representing an individual pending tuple during
+ * deduplication.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} BTDedupInterval;
+
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal items
+ * from the page, starting from page offset number 'baseoff'.  This is the
+ * offset number of the "base" tuple for new posting list.  'nitems' is the
+ * current total number of existing items from the page that will be merged to
+ * make a new posting list tuple, including the base tuple item.  (Existing
+ * items may themselves be posting list tuples, or regular non-pivot tuples.)
+ *
+ * The total size of the existing tuples to be freed when pending posting list
+ * is processed gets tracked by 'phystupsize'.  This information allows
+ * deduplication to calculate the space saving for each new posting list
+ * tuple, and for the entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	bool		deduplicate;	/* Still deduplicating page? */
+	Size		maxpostingsize; /* Limit on size of final tuple */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without original posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in nhtids array */
+	int			nitems;			/* Number of existing tuples/line pointers */
+	Size		phystupsize;	/* Includes line pointer overhead */
+
+	/*
+	 * Array of tuples to go on new version of the page.  Contains one entry
+	 * for each group of consecutive items.  Note that existing tuples that
+	 * will not become posting list tuples do not appear in the array (they
+	 * are implicitly unchanged by deduplication pass).
+	 */
+	int			nintervals;		/* current size of intervals array */
+	BTDedupInterval intervals[MaxIndexTuplesPerPage];
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
+/*
+ * BTVacuumPostingData is state that represents how to VACUUM a posting list
+ * tuple when some (though not all) of its TIDs are to be deleted.
+ *
+ * Convention is that itup field is the original posting list tuple on input,
+ * and palloc()'d final tuple used to overwrite existing tuple on output.
+ */
+typedef struct BTVacuumPostingData
+{
+	/* Tuple that will be/was updated */
+	IndexTuple	itup;
+	OffsetNumber updatedoffset;
+
+	/* State needed to describe final itup in WAL */
+	uint16		ndeletedtids;
+	uint16		deletetids[FLEXIBLE_ARRAY_MEMBER];
+} BTVacuumPostingData;
+
+typedef BTVacuumPostingData *BTVacuumPosting;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -544,7 +822,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list
+ * tuple.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -588,7 +868,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -696,6 +976,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplicate_items;	/* Try to deduplicate items? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -706,6 +987,11 @@ typedef struct BTOptions
 	 BTREE_DEFAULT_FILLFACTOR)
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetDeduplicateItems(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplicate_items : true))
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -752,6 +1038,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber baseoff);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Page newpage, BTDedupState state);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern void _bt_update_posting(BTVacuumPosting vacposting);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -770,14 +1072,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool allequalimage);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -786,7 +1090,8 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								BTVacuumPosting *updatable, int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..251846f304 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP		0x50	/* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST	0x60	/* add index tuple with posting split */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		allequalimage;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +154,33 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents a deduplication pass for a leaf page.  An array
+ * of BTDedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+	uint16		nintervals;
+
+	/* DEDUPLICATION INTERVALS FOLLOW */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * the TIDs within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -150,21 +211,43 @@ typedef struct xl_btree_reuse_page
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
 
 /*
- * This is what we need to know about vacuum of individual leaf index tuples.
- * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
+ * This is what we need to know about which TIDs to remove from an individual
+ * posting list tuple during vacuuming.  An array of these may appear at the
+ * end of xl_btree_vacuum records.
+ */
+typedef struct xl_btree_update
+{
+	uint16		ndeletedtids;
+
+	/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
+} xl_btree_update;
+
+#define SizeOfBtreeUpdate	(offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
+
+/*
+ * This is what we need to know about a VACUUM of a leaf page.  The WAL record
+ * can represent deletion of any number of index tuples on a single index page
+ * when executed by VACUUM.  It can also support "updates" of index tuples,
+ * which is how deletes of a subset of TIDs contained in an existing posting
+ * list tuple are implemented. (Updates are only used when there will be some
+ * remaining TIDs once VACUUM finishes; otherwise the posting list tuple can
+ * just be deleted).
  *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * Updated posting list tuples are represented using xl_btree_update metadata.
+ * The REDO routine uses each xl_btree_update (plus its corresponding original
+ * index tuple from the target leaf page) to generate the final updated tuple.
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES METADATA ARRAY FOLLOWS */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +328,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..f2b03a6cfc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplicate_items",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..6499f5adb7 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every table TID within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,134 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.  Storing
+duplicates in a space efficient manner is not the goal, since in the long
+run there won't be any duplicates anyway.  Rather, we're buying time for
+standard garbage collection mechanisms to run before a page split is
+needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves a situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost/max
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..12d5150844
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,830 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+							 OffsetNumber minoff, IndexTuple newitem);
+static void _bt_singleval_fillfactor(Page page, BTDedupState state,
+									 Size newitemsz);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_posting_valid(IndexTuple posting);
+#endif
+
+/*
+ * Deduplicate items on a leaf page.  The page will have to be split by caller
+ * if we cannot succesfully free at least newitemsz (we also need space for
+ * newitem's line pointer, which isn't included in caller's newitemsz).
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible.  Note, however, that "single
+ * value" strategy is sometimes used for !checkingunique callers, in which
+ * case deduplication will leave a few tuples untouched at the end of the
+ * page.  The general idea is to prepare the page for an anticipated page
+ * split that uses nbtsplitloc.c's "single value" strategy to determine a
+ * split point.  (There is no reason to deduplicate items that will end up on
+ * the right half of the page after the anticipated page split; better to
+ * handle those if and when the anticipated right half page gets its own
+ * deduplication pass, following further inserts of duplicates.)
+ *
+ * This function should be called during insertion, when the page doesn't have
+ * enough space to fit an incoming newitem.  If the BTP_HAS_GARBAGE page flag
+ * was set, caller should have removed any LP_DEAD items by calling
+ * _bt_vacuum_one_page() before calling here.  We may still have to kill
+ * LP_DEAD items here when the page's BTP_HAS_GARBAGE hint is falsely unset,
+ * but that should be rare.  Also, _bt_vacuum_one_page() won't unset the
+ * BTP_HAS_GARBAGE flag when it finds no LP_DEAD items, so a successful
+ * deduplication pass will always clear it, just to keep things tidy.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	Page		newpage;
+	int			newpagendataitems = 0;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	bool		singlevalstrat = false;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * We can't assume that there are no LP_DEAD items.  For one thing, VACUUM
+	 * will clear the BTP_HAS_GARBAGE hint without reliably removing items
+	 * that are marked LP_DEAD.  We don't want to unnecessarily unset LP_DEAD
+	 * bits when deduplicating items.  Allowing it would be correct, though
+	 * wasteful.
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split using the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 *
+	 * It would be possible for maxpostingsize (limit on posting list tuple
+	 * size) to be set to one third of the page.  However, it seems like a
+	 * good idea to limit the size of posting lists to one sixth of a page.
+	 * That ought to leave us with a good split point when pages full of
+	 * duplicates can be split several times.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->deduplicate = true;
+	state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = palloc(state->maxpostingsize);
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+	/* nintervals should be initialized to zero */
+	state->nintervals = 0;
+
+	/* Determine if "single value" strategy should be used */
+	if (!checkingunique)
+		singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+	/*
+	 * Deduplicate items from page, and write them to newpage.
+	 *
+	 * Copy the original page's LSN into newpage copy.  This will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	PageSetLSN(newpage, PageGetLSN(page));
+
+	/* Copy high key, if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (offnum == minoff)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (state->deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed current
+			 * maxpostingsize).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and actually update the page.  Else
+			 * reset the state and move on without modifying the page.
+			 */
+			pagesaving += _bt_dedup_finish_pending(newpage, state);
+			newpagendataitems++;
+
+			if (singlevalstrat)
+			{
+				/*
+				 * Single value strategy's extra steps.
+				 *
+				 * Lower maxpostingsize for sixth and final item that might be
+				 * deduplicated by current deduplication pass.  When sixth
+				 * item formed/observed, stop deduplicating items.
+				 *
+				 * Note: It's possible that this will be reached even when
+				 * current deduplication pass has yet to merge together some
+				 * existing items.  It doesn't matter whether or not the
+				 * current call generated the maxpostingsize-capped duplicate
+				 * tuples at the start of the page.
+				 */
+				if (newpagendataitems == 5)
+					_bt_singleval_fillfactor(page, state, newitemsz);
+				else if (newpagendataitems == 6)
+				{
+					state->deduplicate = false;
+					singlevalstrat = false; /* won't be back here */
+				}
+			}
+
+			/* itup starts new pending posting list */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_finish_pending(newpage, state);
+	newpagendataitems++;
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 *
+	 * We could determine whether or not to proceed on the basis the space
+	 * savings being sufficient to avoid an immediate page split instead.  We
+	 * don't do that because there is some small value in nbtsplitloc.c always
+	 * operating against a page that is fully deduplicated (apart from
+	 * newitem).  Besides, most of the cost has already been paid.
+	 */
+	if (state->nintervals == 0)
+	{
+		/* cannot leak memory here */
+		pfree(newpage);
+		pfree(state->htids);
+		pfree(state);
+		return;
+	}
+
+	/*
+	 * By here, it's clear that deduplication will definitely go ahead.
+	 *
+	 * Clear the BTP_HAS_GARBAGE gpage flag in the unlikely event that it is
+	 * still falsely set, just to keep things tidy.  (We can't rely on
+	 * _bt_vacuum_one_page() having done this already, and we can't rely on a
+	 * page split or VACUUM getting to it in the near future.)
+	 */
+	if (P_HAS_GARBAGE(opaque))
+	{
+		BTPageOpaque nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.nintervals = state->nintervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/*
+		 * The intervals array is not in the buffer, but pretend that it is.
+		 * When XLogInsert stores the whole buffer, the array need not be
+		 * stored too.
+		 */
+		XLogRegisterBufData(0, (char *) state->intervals,
+							state->nintervals * sizeof(BTDedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's base tuple.
+ *
+ * Every tuple processed by deduplication either becomes the base tuple for a
+ * posting list, or gets its heap TID(s) accepted into a pending posting list.
+ * A tuple that starts out as the base tuple for a posting list will only
+ * actually be rewritten within _bt_dedup_finish_pending() when it turns out
+ * that there are duplicates that can be merged into the base tuple.
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TID(s) from new base tuple for new candidate posting list
+	 * into working state's array
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, &base->t_tid, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* basetupsize should not include existing posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain physical size of all existing tuples (including line
+	 * pointer overhead) so that we can calculate space savings on page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->intervals[state->nintervals].baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state now
+ * includes itup's heap TID(s).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over maxpostingsize limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxpostingsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Page newpage, BTDedupState state)
+{
+	OffsetNumber tupoff;
+	Size		tuplesz;
+	Size		spacesaving;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+	tupoff = OffsetNumberNext(PageGetMaxOffsetNumber(newpage));
+	if (state->nitems == 1)
+	{
+		/* Use original, unchanged base tuple */
+		tuplesz = IndexTupleSize(state->base);
+		if (PageAddItem(newpage, (Item) state->base, tuplesz, tupoff,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		spacesaving = 0;
+	}
+	else
+	{
+		IndexTuple	final;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		tuplesz = IndexTupleSize(final);
+		Assert(tuplesz <= state->maxpostingsize);
+
+		/* Save final number of items for posting list */
+		state->intervals[state->nintervals].nitems = state->nitems;
+
+		Assert(tuplesz == MAXALIGN(IndexTupleSize(final)));
+		if (PageAddItem(newpage, (Item) final, tuplesz, tupoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		pfree(final);
+		spacesaving = state->phystupsize - (tuplesz + sizeof(ItemIdData));
+		/* Increment nintervals, since we wrote a new posting list tuple */
+		state->nintervals++;
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied.  The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value.  When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split.  The first deduplication pass should only
+ * find regular non-pivot tuples.  Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over.  The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples.  The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive.  Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes.  Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ *
+ * Note: We deliberately don't bother checking if the high key is a distinct
+ * value (prior to the TID tiebreaker column) before proceeding, unlike
+ * nbtsplitloc.c.  Its single value strategy only gets applied on the
+ * rightmost page of duplicates of the same value (other leaf pages full of
+ * duplicates will get a simple 50:50 page split instead of splitting towards
+ * the end of the page).  There is little point in making the same distinction
+ * here.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+				 OffsetNumber minoff, IndexTuple newitem)
+{
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	ItemId		itemid;
+	IndexTuple	itup;
+
+	itemid = PageGetItemId(page, minoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+
+	if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+	{
+		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Lower maxpostingsize when using "single value" strategy, to avoid a sixth
+ * and final maxpostingsize-capped tuple.  The sixth and final posting list
+ * tuple will end up somewhat smaller than the first five.  (Note: The first
+ * five tuples could actually just be very large duplicate tuples that
+ * couldn't be merged together at all.  Deduplication will simply not modify
+ * the page when that happens.)
+ *
+ * When there are six posting lists on the page (after current deduplication
+ * pass goes on to create/observe a sixth very large tuple), caller should end
+ * its deduplication pass.  It isn't useful to try to deduplicate items that
+ * are supposed to end up on the new right sibling page following the
+ * anticipated page split.  A future deduplication pass of future right
+ * sibling page might take care of it.  (This is why the first single value
+ * strategy deduplication pass for a given leaf page will generally find only
+ * plain non-pivot tuples -- see _bt_do_singleval() comments.)
+ */
+static void
+_bt_singleval_fillfactor(Page page, BTDedupState state, Size newitemsz)
+{
+	Size		leftfree;
+	int			reduction;
+
+	/* This calculation needs to match nbtsplitloc.c */
+	leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+	/* Subtract size of new high key (includes pivot heap TID space) */
+	leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+	/*
+	 * Reduce maxpostingsize by an amount equal to target free space on left
+	 * half of page
+	 */
+	reduction = leftfree * ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+	if (state->maxpostingsize > reduction)
+		state->maxpostingsize -= reduction;
+	else
+		state->maxpostingsize = 0;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in htids array.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/* Determine final size of new tuple */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(_bt_posting_valid(itup));
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+		Assert(ItemPointerIsValid(&itup->t_tid));
+	}
+
+	return itup;
+}
+
+/*
+ * Generate a replacement tuple by "updating" a posting list tuple so that it
+ * no longer has TIDs that need to be deleted.
+ *
+ * Used by VACUUM.  Caller's vacposting argument points to the existing
+ * posting list tuple to be updated.
+ *
+ * On return, caller's vacposting argument will point to final "updated"
+ * tuple, which will be palloc()'d in caller's memory context.
+ */
+void
+_bt_update_posting(BTVacuumPosting vacposting)
+{
+	IndexTuple	origtuple = vacposting->itup;
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+	int			nhtids;
+	int			ui,
+				d;
+	ItemPointer htids;
+
+	nhtids = BTreeTupleGetNPosting(origtuple) - vacposting->ndeletedtids;
+
+	Assert(_bt_posting_valid(origtuple));
+	Assert(nhtids > 0 && nhtids < BTreeTupleGetNPosting(origtuple));
+
+	if (BTreeTupleIsPosting(origtuple))
+		keysize = BTreeTupleGetPostingOffset(origtuple);
+	else
+		keysize = IndexTupleSize(origtuple);
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.  We avoid calling _bt_form_posting() here
+	 * to save ourselves a second memory allocation for a htids workspace.
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, origtuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		htids = BTreeTupleGetPosting(itup);
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		htids = &itup->t_tid;
+	}
+
+	ui = 0;
+	d = 0;
+	for (int i = 0; i < BTreeTupleGetNPosting(origtuple); i++)
+	{
+		if (d < vacposting->ndeletedtids && vacposting->deletetids[d] == i)
+		{
+			d++;
+			continue;
+		}
+		htids[ui++] = *BTreeTupleGetPostingN(origtuple, i);
+	}
+	Assert(ui == nhtids);
+	Assert(d == vacposting->ndeletedtids);
+	Assert(nhtids == 1 || _bt_posting_valid(itup));
+
+	/* vacposting arg's itup will now point to updated version */
+	vacposting->itup = itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should pass their own private
+ * copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This happens inside the same critical
+ * section that performs an in-place update of old posting list using new
+ * posting list returned here.)
+ *
+ * While the keys from newitem and oposting must be opclass equal, and must
+ * generate identical output when run through the underlying type's output
+ * function, it doesn't follow that their representations match exactly.
+ * Caller must avoid assuming that there can't be representational differences
+ * that make datums from oposting bigger or smaller than the corresponding
+ * datums from newitem.  For example, differences in TOAST input state might
+ * break a faulty assumption about tuple size (the executor is entitled to
+ * apply TOAST compression based on its own criteria).  It also seems possible
+ * that further representational variation will be introduced in the future,
+ * in order to support nbtree features like page-level prefix compression.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *replaceposright;
+	Size		nmovebytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(_bt_posting_valid(oposting));
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID.  We shift TIDs one place to the right, losing original
+	 * rightmost TID. (nmovebytes must not include TIDs to the left of
+	 * postingoff, nor the existing rightmost/max TID that gets overwritten.)
+	 */
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);
+	nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+	memmove(replaceposright, replacepos, nmovebytes);
+
+	/* Fill the gap at postingoff with TID of new item (original new TID) */
+	Assert(!BTreeTupleIsPivot(newitem) && !BTreeTupleIsPosting(newitem));
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Now copy oposting's rightmost/max TID into new item (final new TID) */
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(oposting), &newitem->t_tid);
+
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(_bt_posting_valid(nposting));
+
+	return nposting;
+}
+
+/*
+ * Verify posting list invariants for "posting", which must be a posting list
+ * tuple.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_posting_valid(IndexTuple posting)
+{
+	ItemPointerData last;
+	ItemPointer htid;
+
+	if (!BTreeTupleIsPosting(posting) || BTreeTupleGetNPosting(posting) < 2)
+		return false;
+
+	/* Remember first heap TID for loop */
+	ItemPointerCopy(BTreeTupleGetHeapTID(posting), &last);
+	if (!ItemPointerIsValid(&last))
+		return false;
+
+	/* Iterate, starting from second TID */
+	for (int i = 1; i < BTreeTupleGetNPosting(posting); i++)
+	{
+		htid = BTreeTupleGetPostingN(posting, i);
+
+		if (!ItemPointerIsValid(htid))
+			return false;
+		if (ItemPointerCompare(htid, &last) <= 0)
+			return false;
+		ItemPointerCopy(htid, &last);
+	}
+
+	return true;
+}
+#endif
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4e5849ab8e..0648ffa37c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +127,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -295,7 +298,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -340,6 +343,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 uint32 *speculativeToken)
 {
 	IndexTuple	itup = insertstate->itup;
+	IndexTuple	curitup;
+	ItemId		curitemid;
 	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
 	OffsetNumber offset;
@@ -348,6 +353,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -375,13 +383,21 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
-		ItemId		curitemid;
-		IndexTuple	curitup;
-		BlockNumber nblkno;
-
 		/*
-		 * make sure the offset points to an actual item before trying to
-		 * examine it...
+		 * Each iteration of the loop processes one heap TID, not one index
+		 * tuple.  Current offset number for page isn't usually advanced on
+		 * iterations that process heap TIDs from posting list tuples.
+		 *
+		 * "inposting" state is set when _inside_ a posting list --- not when
+		 * we're at the start (or end) of a posting list.  We advance curposti
+		 * at the end of the iteration when inside a posting list tuple.  In
+		 * general, every loop iteration either advances the page offset or
+		 * advances curposti --- an iteration that handles the rightmost/max
+		 * heap TID in a posting list finally advances the page offset (and
+		 * unsets "inposting").
+		 *
+		 * Make sure the offset points to an actual index tuple before trying
+		 * to examine it...
 		 */
 		if (offset <= maxoff)
 		{
@@ -406,31 +422,60 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				break;
 			}
 
-			curitemid = PageGetItemId(page, offset);
-
 			/*
-			 * We can skip items that are marked killed.
+			 * We can skip items that are already marked killed.
 			 *
 			 * In the presence of heavy update activity an index may contain
 			 * many killed items with the same key; running _bt_compare() on
 			 * each killed item gets expensive.  Just advance over killed
 			 * items as quickly as we can.  We only apply _bt_compare() when
-			 * we get to a non-killed item.  Even those comparisons could be
-			 * avoided (in the common case where there is only one page to
-			 * visit) by reusing bounds, but just skipping dead items is fast
-			 * enough.
+			 * we get to a non-killed item.  We could reuse the bounds to
+			 * avoid _bt_compare() calls for known equal tuples, but it
+			 * doesn't seem worth it.  Workloads with heavy update activity
+			 * tend to have many deduplication passes, so we'll often avoid
+			 * most of those comparisons, too (we call _bt_compare() when the
+			 * posting list tuple is initially encountered, though not when
+			 * processing later TIDs from the same tuple).
 			 */
-			if (!ItemIdIsDead(curitemid))
+			if (!inposting)
+				curitemid = PageGetItemId(page, offset);
+			if (inposting || !ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
 				bool		all_dead;
 
-				if (_bt_compare(rel, itup_key, page, offset) != 0)
-					break;		/* we're past all the equal tuples */
+				if (!inposting)
+				{
+					/* Plain tuple, or first TID in posting list tuple */
+					if (_bt_compare(rel, itup_key, page, offset) != 0)
+						break;	/* we're past all the equal tuples */
 
-				/* okay, we gotta fetch the heap tuple ... */
-				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+					/* Advanced curitup */
+					curitup = (IndexTuple) PageGetItem(page, curitemid);
+					Assert(!BTreeTupleIsPivot(curitup));
+				}
+
+				/* okay, we gotta fetch the heap tuple using htid ... */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					/* ... htid is from simple non-pivot tuple */
+					Assert(!inposting);
+					htid = curitup->t_tid;
+				}
+				else if (!inposting)
+				{
+					/* ... htid is first TID in new posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+					htid = *BTreeTupleGetPostingN(curitup, 0);
+				}
+				else
+				{
+					/* ... htid is second or subsequent TID in posting list */
+					Assert(curposti > 0);
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
+				}
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -506,8 +551,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -565,12 +609,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -584,14 +630,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -606,7 +667,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
 			{
-				nblkno = opaque->btpo_next;
+				BlockNumber nblkno = opaque->btpo_next;
+
 				nbuf = _bt_relandgetbuf(rel, nbuf, nblkno, BT_READ);
 				page = BufferGetPage(nbuf);
 				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -616,6 +678,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			/* Will also advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -684,6 +749,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -699,6 +765,9 @@ _bt_findinsertloc(Relation rel,
 
 	if (itup_key->heapkeyspace)
 	{
+		/* Keep track of whether checkingunique duplicate seen */
+		bool		uniquedup = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -715,6 +784,13 @@ _bt_findinsertloc(Relation rel,
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				uniquedup = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -741,18 +817,43 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (if checkingunique) */
+				uniquedup = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that fails to free enough space, see if
+		 * we can avoid a page split by performing a deduplication pass over
+		 * the page.
+		 *
+		 * We only perform a deduplication pass for a checkingunique caller
+		 * when the incoming item is a duplicate of an existing item on the
+		 * leaf page.  This heuristic avoids wasting cycles -- we only expect
+		 * to benefit from deduplicating a unique index page when most or all
+		 * recently added items are duplicates.  See nbtree/README.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates (if checkingunique) */
+				uniquedup = true;
+			}
+
+			if (itup_key->allequalimage && BTGetDeduplicateItems(rel) &&
+				(!checkingunique || uniquedup) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -834,7 +935,30 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  We don't want to unnecessarily unset its LP_DEAD bit while
+		 * performing a posting list split, so delete all LP_DEAD items early.
+		 * Note that this is the only case where LP_DEAD deletes happen even
+		 * though there is space for newitem on the page.
+		 */
+		_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+
+		/*
+		 * Do new binary search.  New insert location cannot overlap with any
+		 * posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -900,10 +1024,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -931,11 +1057,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -949,6 +1079,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -959,6 +1090,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->allequalimage);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost/max TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -991,7 +1150,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,6 +1226,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1115,8 +1278,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1139,6 +1313,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.allequalimage = metad->btm_allequalimage;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1147,7 +1322,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1189,6 +1384,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1204,12 +1407,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1229,6 +1444,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1298,6 +1514,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1335,6 +1579,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1368,6 +1614,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1383,6 +1631,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1447,6 +1696,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1475,8 +1725,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1645,8 +1903,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1665,11 +1927,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1829,7 +2115,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2185,6 +2471,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2265,7 +2552,7 @@ _bt_pgaddtup(Page page,
 static void
 _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 {
-	OffsetNumber deletable[MaxOffsetNumber];
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				minoff,
@@ -2298,6 +2585,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..529eed027b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool allequalimage)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_allequalimage = allequalimage;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_allequalimage);
+	metad->btm_allequalimage = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_allequalimage ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.allequalimage = metad->btm_allequalimage;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,34 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_allequalimage ||
+		   metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *allequalimage arguments using data
+ *		from the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets allequalimage field, which indicates whether or not it is
+ *		safe to apply deduplication.  We rely on the assumption that
+ *		btm_allequalimage will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +675,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*allequalimage = metad->btm_allequalimage;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +703,12 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_allequalimage ||
+		   metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*allequalimage = metad->btm_allequalimage;
 }
 
 /*
@@ -964,28 +992,106 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with deleting TIDs when some (but not all) of the heap TIDs
+ * in an existing posting list item are to be removed by VACUUM.  This works
+ * by updating/overwriting an existing item with caller's new version of the
+ * item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					BTVacuumPosting *updatable, int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen = 0;
+	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* Replace work area IndexTuple with updated version */
+		_bt_update_posting(updatable[i]);
+
+		/* Maintain array of updatable page offsets for WAL record */
+		updatedoffsets[i] = updatable[i]->updatedoffset;
+	}
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+		{
+			BTVacuumPosting vacposting = updatable[i];
+
+			itemsz = SizeOfBtreeUpdate +
+				vacposting->ndeletedtids * sizeof(uint16);
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			BTVacuumPosting vacposting = updatable[i];
+			xl_btree_update update;
+
+			update.ndeletedtids = vacposting->ndeletedtids;
+			memcpy(updatedbuf + offset, &update.ndeletedtids,
+				   SizeOfBtreeUpdate);
+			offset += SizeOfBtreeUpdate;
+
+			itemsz = update.ndeletedtids * sizeof(uint16);
+			memcpy(updatedbuf + offset, vacposting->deletetids, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber updatedoffset = updatedoffsets[i];
+		IndexTuple	itup;
+
+		itup = updatable[i]->itup;
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+		if (!PageIndexTupleOverwrite(page, updatedoffset, (Item) itup,
+									 itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1112,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1127,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatedoffsets,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1150,13 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
+	/* free tuples generated by calling _bt_update_posting() */
+	for (int i = 0; i < nupdatable; i++)
+		pfree(updatable[i]->itup);
 }
 
 /*
@@ -1050,6 +1169,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire tuple can be
+ * deleted as a whole (since there is only one LP_DEAD bit per line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1186,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1233,83 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2255,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.allequalimage = metad->btm_allequalimage;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5254bc7ef5..4bb16297c3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,10 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
+										  IndexTuple posting,
+										  OffsetNumber updatedoffset,
+										  int *nremaining);
 
 
 /*
@@ -161,7 +165,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -264,8 +268,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxTIDsPerBTreePage * sizeof(int));
+				if (so->numKilled < MaxTIDsPerBTreePage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1154,11 +1158,15 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		BTVacuumPosting updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1190,8 +1198,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1199,11 +1210,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1226,22 +1235,82 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					BTVacuumPosting vacposting;
+					int			nremaining;
+
+					/* Posting list tuple */
+					vacposting = btreevacuumposting(vstate, itup, offnum,
+													&nremaining);
+					if (vacposting == NULL)
+					{
+						/*
+						 * All table TIDs from the posting tuple remain, so no
+						 * delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+
+						/*
+						 * Store metadata about posting list tuple in
+						 * updatable array for entire page.  Existing tuple
+						 * will be updated during the later call to
+						 * _bt_delitems_vacuum().
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatable[nupdatable++] = vacposting;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+					}
+					else
+					{
+						/*
+						 * All table TIDs from the posting list must be
+						 * deleted.  We'll delete the index tuple completely
+						 * (no update required).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(vacposting);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updatable[i]);
 		}
 		else
 		{
@@ -1254,6 +1323,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1263,15 +1333,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
-		 * freePages out-of-order (doesn't seem worth any extra code to handle
-		 * the case).
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as separate live tuples).
+		 * We don't delete when recursing, though, to avoid putting entries
+		 * into freePages out-of-order (doesn't seem worth any extra code to
+		 * handle the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1303,9 +1376,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1314,6 +1388,61 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns metadata describing how to build replacement tuple without the TIDs
+ * that VACUUM needs to delete.  Returned value is NULL in the common case
+ * where no changes are needed to caller's posting list tuple (we avoid
+ * allocating memory here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is set for
+ * caller in *nremaining.
+ */
+static BTVacuumPosting
+btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+				   OffsetNumber updatedoffset, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer items = BTreeTupleGetPosting(posting);
+	BTVacuumPosting vacposting = NULL;
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/* Live table TID */
+			live++;
+		}
+		else if (vacposting == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining metadata describing how to update
+			 * existing posting list tuple.
+			 */
+			vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
+								nitem * sizeof(uint16));
+
+			vacposting->itup = posting;
+			vacposting->updatedoffset = updatedoffset;
+			vacposting->ndeletedtids = 0;
+			vacposting->deletetids[vacposting->ndeletedtids++] = i;
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+			vacposting->deletetids[vacposting->ndeletedtids++] = i;
+		}
+	}
+
+	*nremaining = live;
+	return vacposting;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c573814f01..e743cfb3a2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->allequalimage);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +693,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * Scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * with scantid.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1229,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1484,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxTIDsPerBTreePage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1568,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1610,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index baec5de999..e66cd36dfa 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -563,6 +567,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	/* _bt_mkscankey() won't set allequalimage without metapage */
+	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -711,6 +717,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +796,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +829,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +863,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +906,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple has a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +967,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  Even still, the lastleft and firstright
+			 * tuples passed to _bt_truncate() here are at least not fully
+			 * equal to each other when deduplication is used, unless there is
+			 * a large group of duplicates (also, unique index builds usually
+			 * have few or no spool2 duplicates).  When the split point is
+			 * between two unequal tuples, _bt_truncate() will avoid including
+			 * a heap TID in the new high key, which is the most important
+			 * benefit of suffix truncation.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1071,43 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	Assert(dstate->nitems > 0);
+
+	if (dstate->nitems == 1)
+		_bt_buildadd(wstate, state, dstate->base, 0);
+	else
+	{
+		IndexTuple	postingtuple;
+		Size		truncextra;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		/* Calculate posting list overhead */
+		truncextra = IndexTupleSize(postingtuple) -
+			BTreeTupleGetPostingOffset(postingtuple);
+
+		_bt_buildadd(wstate, state, postingtuple, truncextra);
+		pfree(postingtuple);
+	}
+
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1153,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1174,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1196,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->allequalimage &&
+		BTGetDeduplicateItems(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1296,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1311,100 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->deduplicate = true; /* unused */
+		dstate->maxpostingsize = 0; /* set later */
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+		dstate->nintervals = 0; /* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to 1/10 space we want to
+				 * leave behind on the page, plus space for final item's line
+				 * pointer.  This is equal to the space that we'd like to
+				 * leave behind on each leaf page when fillfactor is 90,
+				 * allowing us to get close to fillfactor% space utilization
+				 * when there happen to be a great many duplicates.  (This
+				 * makes higher leaf fillfactor settings ineffective when
+				 * building indexes that have many duplicates, but packing
+				 * leaf pages full with few very large tuples doesn't seem
+				 * like a useful goal.)
+				 */
+				dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+					sizeof(ItemIdData);
+				Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+					   dstate->maxpostingsize <= INDEX_SIZE_MASK);
+				dstate->htids = palloc(dstate->maxpostingsize);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list.  Heap
+				 * TID from itup has been saved in state.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * _bt_dedup_save_htid() opted to not merge current item into
+				 * pending posting list.
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				pfree(dstate->base);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		if (state)
+		{
+			/*
+			 * Handle the last item (there must be a last item when the
+			 * tuplesort returned one or more tuples)
+			 */
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1412,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c9f0402f8e..e759d72eff 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -81,7 +81,10 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		determine whether or not the keys in the index are expected to be
  *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
  *		heapkeyspace index when caller passes a NULL tuple, allowing index
- *		build callers to avoid accessing the non-existent metapage.
+ *		build callers to avoid accessing the non-existent metapage.  We
+ *		also assume that the index is _not_ allequalimage when a NULL tuple
+ *		is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ *		field themselves.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -108,7 +111,14 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+	else
+	{
+		/* Utility statement callers can set these fields themselves */
+		key->heapkeyspace = true;
+		key->allequalimage = false;
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1374,6 +1384,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1535,6 +1546,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1774,10 +1786,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2018,7 +2085,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplicate_items", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplicate_items)}
 
 	};
 
@@ -2119,15 +2188,22 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
+	/*
+	 * Deduplication should only be considered safe when _bt_keep_natts() and
+	 * _bt_keep_natts_fast() will always give the same answer.  Assert that
+	 * this condition is met for allequalimage indexes in passing.
+	 */
+	Assert(!itup_key->allequalimage ||
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
 	keepnatts = nkeyatts + 1;
@@ -2139,6 +2215,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2155,6 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2172,6 +2263,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(firstright))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include the max lastleft
+			 * heap TID in the final pivot tuple, but we can remove the
+			 * posting list now. (Pivot tuples should never contain a posting
+			 * list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2199,7 +2303,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2210,9 +2314,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2225,7 +2332,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2234,7 +2341,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2315,13 +2423,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2393,28 +2504,42 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * Mask allocated for number of keys in index tuple must be able to fit
 	 * maximum possible number of index attributes
 	 */
-	StaticAssertStmt(BT_N_KEYS_OFFSET_MASK >= INDEX_MAX_KEYS,
-					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
+	StaticAssertStmt(BT_OFFSET_MASK >= INDEX_MAX_KEYS,
+					 "BT_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2458,12 +2583,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2489,7 +2614,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2559,8 +2688,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..4d1170ff47 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_allequalimage = xlrec->allequalimage;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,98 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		char	   *ptr = XLogRecGetBlockData(record, 0, NULL);
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		OffsetNumber offnum,
+					minoff,
+					maxoff;
+		BTDedupState state;
+		BTDedupInterval *intervals;
+		Page		newpage;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		state->deduplicate = true;	/* unused */
+		/* Conservatively use larger maxpostingsize than primary */
+		state->maxpostingsize = BTMaxItemSize(page);
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = palloc(state->maxpostingsize);
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+		state->nintervals = 0;
+
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		intervals = (BTDedupInterval *) ptr;
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == minoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (state->nintervals < xlrec->nintervals &&
+					 state->baseoff == intervals[state->nintervals].baseoff &&
+					 state->nitems < intervals[state->nintervals].nitems)
+			{
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+			else
+			{
+				_bt_dedup_finish_pending(newpage, state);
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+		}
+
+		_bt_dedup_finish_pending(newpage, state);
+		Assert(state->nintervals == xlrec->nintervals);
+		Assert(memcmp(state->intervals, intervals,
+					  state->nintervals * sizeof(BTDedupInterval)) == 0);
+
+		if (P_HAS_GARBAGE(opaque))
+		{
+			BTPageOpaque nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+			nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		}
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +575,56 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			xl_btree_update *updates;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updates = (xl_btree_update *) ((char *) updatedoffsets +
+										   xlrec->nupdated *
+										   sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				BTVacuumPosting vacposting;
+				IndexTuple	origtuple;
+				ItemId		itemid;
+				Size		itemsz;
+
+				itemid = PageGetItemId(page, updatedoffsets[i]);
+				origtuple = (IndexTuple) PageGetItem(page, itemid);
+
+				vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
+									updates->ndeletedtids * sizeof(uint16));
+				vacposting->updatedoffset = updatedoffsets[i];
+				vacposting->itup = origtuple;
+				vacposting->ndeletedtids = updates->ndeletedtids;
+				memcpy(vacposting->deletetids,
+					   (char *) updates + SizeOfBtreeUpdate,
+					   updates->ndeletedtids * sizeof(uint16));
+
+				_bt_update_posting(vacposting);
+
+				/* Overwrite updated version of tuple */
+				itemsz = MAXALIGN(IndexTupleSize(vacposting->itup));
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) vacposting->itup, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				pfree(vacposting->itup);
+				pfree(vacposting);
+
+				/* advance to next xl_btree_update/update */
+				updates = (xl_btree_update *)
+					((char *) updates + SizeOfBtreeUpdate +
+					 updates->ndeletedtids * sizeof(uint16));
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +943,22 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +966,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +994,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..7bbe55c5cf 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,24 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "nintervals %u", xlrec->nintervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +140,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP:
+			id = "DEDUP";
+			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 4ea6ea7a3d..f57ea0a0e7 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1048,8 +1048,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1135,7 +1137,8 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 	}
 
 	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index dc03fbde13..b6b08d0ccb 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1731,14 +1731,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplicate_items",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplicate_items =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..8a830e570c 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_plain_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					allequalimage;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
+		 * bt_posting_plain_tuple() for definition, and details of how posting
+		 * list tuples are handled.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +927,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +958,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1000,20 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * TID, since the posting list itself is validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1025,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is a posting list tuple, make sure posting list TIDs are
+		 * in order
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1091,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1117,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "plain" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_plain_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1150,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1191,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && BTreeTupleIsPosting(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1219,8 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		/* Reset, in case scantid was set to (itup) posting tuple's max TID */
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2027,9 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2042,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2107,29 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "plain" tuple for nth posting list entry/TID.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple index tuples are merged together into one equivalent
+ * posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "plain"
+ * tuples.  Each tuple must be fingerprinted separately -- there must be one
+ * tuple for each corresponding Bloom filter probe during the heap scan.
+ *
+ * Note: Caller still needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_plain_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2186,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2194,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2650,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out
index 33e3676d3c..4be8cccdaa 100644
--- a/contrib/citext/expected/citext_1.out
+++ b/contrib/citext/expected/citext_1.out
@@ -238,6 +238,7 @@ WHERE  citext_hash(v)::bit(32) != citext_hash_extended(v, 0)::bit(32)
 CREATE TEMP TABLE try (
    name citext PRIMARY KEY
 );
+NOTICE:  index "try_pkey" cannot use deduplication
 INSERT INTO try (name)
 VALUES ('a'), ('ab'), ('â'), ('aba'), ('b'), ('ba'), ('bab'), ('AZ');
 SELECT name, 'a' = name AS eq_a   FROM try WHERE name <> 'â';
@@ -345,6 +346,7 @@ VALUES ('abb'),
        ('ABC'),
        ('abd');
 CREATE INDEX srt_name ON srt (name);
+NOTICE:  index "srt_name" cannot use deduplication
 -- Check the min() and max() aggregates, with and without index.
 set enable_seqscan = off;
 SELECT MIN(name) AS "ABA" FROM srt;
diff --git a/contrib/hstore/expected/hstore.out b/contrib/hstore/expected/hstore.out
index 4f1db01b3e..67a705a32f 100644
--- a/contrib/hstore/expected/hstore.out
+++ b/contrib/hstore/expected/hstore.out
@@ -1448,6 +1448,7 @@ set enable_sort = true;
 -- btree
 drop index hidx;
 create index hidx on testhstore using btree (h);
+NOTICE:  index "hidx" cannot use deduplication
 set enable_seqscan=off;
 select count(*) from testhstore where h #># 'p=>1';
  count 
diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out
index 8226930905..01344e8b19 100644
--- a/contrib/ltree/expected/ltree.out
+++ b/contrib/ltree/expected/ltree.out
@@ -3348,6 +3348,7 @@ SELECT * FROM ltreetest WHERE t ? '{23.*.1,23.*.2}' order by t asc;
 (4 rows)
 
 create unique index tstidx on ltreetest (t);
+NOTICE:  index "tstidx" cannot use deduplication
 set enable_seqscan=off;
 SELECT * FROM ltreetest WHERE t <  '12.3' order by t asc;
                 t                 
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index e3c69f8de6..a402f33341 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -517,14 +517,191 @@ equalimage() returns bool
 
 </sect1>
 
-<sect1 id="btree-implementation">
- <title>Implementation</title>
+<sect1 id="btree-storage">
+ <title>Physical Storage</title>
+
+ <para>
+  <productname>PostgreSQL</productname> B-Tree indexes are multi-level
+  tree structures, where each level of the tree can be used as a
+  doubly-linked list of pages.  A single metapage is stored in a fixed
+  position at the start of the first segment file of the index.  All
+  other pages are either leaf pages or internal pages.  Typically, the
+  vast majority of all pages are leaf pages unless the index is very
+  small.  Leaf pages are the pages on the lowest level of the tree.
+  All other levels consist of internal pages.
+ </para>
+ <para>
+  Each leaf page contains tuples that point to table entries using a
+  heap item pointer.  Each tuple is considered unique internally,
+  since the item pointer is treated as a tiebreaker column.  Each
+  internal page contains tuples that point to the next level down in
+  the tree.  Both internal pages and leaf pages use the standard page
+  format described in <xref linkend="storage-page-layout"/>.  Index
+  scans use internal pages to locate the first leaf page that could
+  have matching tuples.
+ </para>
+
+ <sect2 id="btree-maintain-structure">
+  <title>Maintaining the Tree Structure</title>
+  <para>
+   New pages are added to a B-Tree index when an existing page becomes
+   full, and a <firstterm>page split</firstterm> is required to fit a
+   new item that belongs on the overflowing page.  New levels are
+   added to a B-Tree index when the root page becomes full, causing a
+   <firstterm>root page split</firstterm>.  Even the largest B-Tree
+   indexes rarely have more than four or five levels.
+  </para>
+  <para>
+   A much more technical guide to the B-Tree index implementation can
+   be found in <filename>src/backend/access/nbtree/README</filename>.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication">
+  <title>Posting List Tuples and Deduplication</title>
+  <para>
+   B-Tree indexes can perform <firstterm>deduplication</firstterm>.  A
+   <firstterm>duplicate</firstterm> is a row where
+   <emphasis>all</emphasis> indexed key columns are equal to the
+   corresponding column values from some other row.  Existing
+   duplicate leaf page tuples are merged together into a single
+   <quote>posting list</quote> tuple during a deduplication pass.  The
+   keys appear only once in this representation,  followed by a sorted
+   array of heap item pointers.  The deduplication process occurs
+   <quote>lazily</quote>, when a new item is inserted that cannot fit
+   on an existing leaf page.  Deduplication significantly reduces the
+   storage size of indexes where each value (or each distinct set of
+   values) appears several times on average.  This is likely to reduce
+   the amount of I/O required by index scans, which can noticeably
+   improve overall query throughput.  It also reduces the overhead of
+   routine index vacuuming.
+  </para>
+  <para>
+   Workloads that don't benefit from deduplication due to having no
+   duplicate values in indexes will incur a small performance penalty
+   with mixed read-write workloads (unless deduplication is explicitly
+   disabled).  The <literal>deduplicate_items</literal> storage
+   parameter can be used to disable deduplication within individual
+   indexes.  See <xref linkend="sql-createindex-storage-parameters"/>
+   from the <command>CREATE INDEX</command> documentation for details.
+   There is never any performance penalty with read-only workloads,
+   since reading from posting lists is at least as efficient as
+   reading the standard index tuple representation.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-versioning">
+  <title>MVCC Versioning and B-Tree Storage</title>
 
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   It is sometimes necessary for B-Tree indexes to contain multiple
+   physical tuples for the same logical table row, even in unique
+   indexes.  HOT updated rows avoid the need to store additional
+   physical versions in indexes, but an update that cannot use the HOT
+   optimization must store new physical tuples in
+   <emphasis>all</emphasis> indexes, including indexes with unchanged
+   indexed key values.  Multiple equal physical tuples that are only
+   needed to point to corresponding versions of the same logical table
+   row are common in some applications.
+  </para>
+  <para>
+   Deduplication tends to avoid page splits that are only needed due
+   to a short-term increase in <quote>duplicate</quote> tuples that
+   all point to different versions of the same logical table row.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but
+   <command>VACUUM</command> usually cannot reverse page splits (in
+   general, a leaf page must be completely empty before
+   <command>VACUUM</command> can <quote>delete</quote> it).  In
+   effect, deduplication delays <quote>version driven</quote> page
+   splits, which may give VACUUM enough time to run and prevent the
+   splits entirely.  Unique indexes make use of deduplication for this
+   reason.  Also, even unique indexes can have a set of
+   <quote>duplicate</quote> rows that are all visible to a given
+   <acronym>MVCC</acronym> snapshot, provided at least one column has
+   a NULL value.  In general, the implementation considers tuples with
+   NULL values to be duplicates for the purposes of deduplication.
+  </para>
+  <para>
+   Unique indexes can only contain non-NULL duplicates because of
+   <quote>version churn</quote>.  The implementation applies a special
+   heuristic when considering whether to attempt deduplication in a
+   unique index.  This heuristic virtually avoids the possibility of a
+   performance penalty in unique indexes.
   </para>
 
+ </sect2>
+
+ <sect2 id="btree-deduplication-limitations">
+  <title>Deduplication Limitations</title>
+
+  <para>
+   In general, the B-Tree implementation treats non-key columns from
+   <literal>INCLUDE</literal> indexes as opaque payload.
+   <literal>INCLUDE</literal> indexes can never use deduplication for
+   this reason.
+  </para>
+  <para>
+   Deduplication can only be used within B-Tree indexes where
+   <emphasis>all</emphasis> columns use a deduplication-safe operator
+   class and collation.  Note that deduplication cannot be used in the
+   following cases:
+  </para>
+  <para>
+   <itemizedlist>
+    <listitem>
+     <para>
+      <type>numeric</type> cannot use deduplication.  In general, a
+      pair of equal <type>numeric</type> datums may still have
+      different <quote>display scales</quote>.  These diferences must
+      be preserved.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>jsonb</type> cannot use deduplication, since the
+      <type>jsonb</type> B-Tree operator class uses
+      <type>numeric</type> internally.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>float4</type>, <type>float8</type> and <type>money</type>
+      cannot use deduplication.  Each of these types has
+      representations for both <literal>-0</literal> and
+      <literal>0</literal> that are treated as equal.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>text</type>, <type>varchar</type>, <type>bpchar</type> and
+      <type>name</type> cannot use deduplication when the collation is
+      a <emphasis>nondeterministic</emphasis> collation.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>tsvector</type> and <type>enum</type> cannot use
+      deduplication.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      Container types (such as composite types, arrays, or range
+      types) cannot use deduplication.  This is an
+      implementation-level restriction that may be lifted in a future
+      version of <productname>PostgreSQL</productname>.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 057a6bb81a..20cdfabd7b 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ceda48e0fc..28035f1635 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16561,10 +16561,11 @@ AND
    rows.  Two rows might have a different binary representation even
    though comparisons of the two rows with the equality operator is true.
    The ordering of rows under these comparison operators is deterministic
-   but not otherwise meaningful.  These operators are used internally for
-   materialized views and might be useful for other specialized purposes
-   such as replication but are not intended to be generally useful for
-   writing queries.
+   but not otherwise meaningful.  These operators are used internally
+   for materialized views and might be useful for other specialized
+   purposes such as replication and B-Tree deduplication (see <xref
+   linkend="btree-deduplication"/>).  They are not intended to be
+   generally useful for writing queries, though.
   </para>
   </sect2>
  </sect1>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ab362a0dc5..a05e2e6b9c 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Furthermore, B-tree deduplication is never used with indexes
+        that have a non-key column.
        </para>
 
        <para>
@@ -393,10 +395,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
+    <term><literal>deduplicate_items</literal>
+     <indexterm>
+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to enable or
+      disable the optimization.  (Alternative spellings of
+      <literal>ON</literal> and <literal>OFF</literal> are allowed as
+      described in <xref linkend="config-setting"/>.) The default is
+      <literal>ON</literal>.
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplicate_items</literal> off via
+      <command>ALTER INDEX</command> prevents future insertions from
+      triggering deduplication, but does not in itself make existing
+      posting list tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -451,9 +482,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
@@ -805,6 +834,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) INCLUDE (director, rating);
 </programlisting>
   </para>
 
+  <para>
+   To create a B-Tree index with deduplication disabled:
+<programlisting>
+CREATE INDEX title_idx ON films (title) WITH (deduplicate_items = off);
+</programlisting>
+  </para>
+
   <para>
    To create an index on the expression <literal>lower(title)</literal>,
    allowing efficient case-insensitive searches:
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index fb6d86a269..5e97723dab 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -96,6 +96,7 @@ SELECT * FROM attmp;
 (1 row)
 
 CREATE INDEX attmp_idx ON attmp (a, (d + e), b);
+NOTICE:  index "attmp_idx" cannot use deduplication
 ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
 ERROR:  column number must be in range from 1 to 32767
 LINE 1: ALTER INDEX attmp_idx ALTER COLUMN 0 SET STATISTICS 1000;
@@ -640,6 +641,7 @@ DROP TABLE PKTABLE;
 -- On the other hand, this should work because int implicitly promotes to
 -- numeric, and we allow promotion on the FK side
 CREATE TEMP TABLE PKTABLE (ptest1 numeric PRIMARY KEY);
+NOTICE:  index "pktable_pkey" cannot use deduplication
 INSERT INTO PKTABLE VALUES(42);
 CREATE TEMP TABLE FKTABLE (ftest1 int);
 ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
@@ -2028,6 +2030,8 @@ Indexes:
     "at_part_2_b_idx" btree (b)
 
 alter table at_partitioned alter column b type numeric using b::numeric;
+NOTICE:  index "at_part_1_b_idx" cannot use deduplication
+NOTICE:  index "at_part_2_b_idx" cannot use deduplication
 \d at_part_1
              Table "public.at_part_1"
  Column |  Type   | Collation | Nullable | Default 
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index c730563f03..49e62b73dc 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -1297,6 +1297,7 @@ SELECT -1 != ALL(ARRAY(SELECT NULLIF(g.i, 900) FROM generate_series(1,1000) g(i)
 
 -- test indexes on arrays
 create temp table arr_tbl (f1 int[] unique);
+NOTICE:  index "arr_tbl_f1_key" cannot use deduplication
 insert into arr_tbl values ('{1,2,3}');
 insert into arr_tbl values ('{1,2}');
 -- failure expected:
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..1646deb092 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -200,7 +200,7 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 -- Also check LIKE optimization with binary-compatible cases
 create temp table btree_bpchar (f1 text collate "C");
-create index on btree_bpchar(f1 bpchar_ops);
+create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
 insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
 -- doesn't match index:
 explain (costs off)
@@ -266,6 +266,24 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+-- get test coverage for "single value" deduplication strategy:
+insert into btree_bpchar select 'foo' from generate_series(1,1500);
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 2b86ce9028..6c9fa043f6 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1468,6 +1468,7 @@ SELECT x, row_number() OVER (ORDER BY x), rank() OVER (ORDER BY x) FROM test3ci
 (4 rows)
 
 CREATE UNIQUE INDEX ON test1ci (x);  -- ok
+NOTICE:  index "test1ci_x_idx" cannot use deduplication
 INSERT INTO test1ci VALUES ('ABC');  -- error
 ERROR:  duplicate key value violates unique constraint "test1ci_x_idx"
 DETAIL:  Key (x)=(ABC) already exists.
@@ -1586,6 +1587,7 @@ SELECT x, row_number() OVER (ORDER BY x), rank() OVER (ORDER BY x) FROM test3bpc
 (4 rows)
 
 CREATE UNIQUE INDEX ON test1bpci (x);  -- ok
+NOTICE:  index "test1bpci_x_idx" cannot use deduplication
 INSERT INTO test1bpci VALUES ('ABC');  -- error
 ERROR:  duplicate key value violates unique constraint "test1bpci_x_idx"
 DETAIL:  Key (x)=(ABC) already exists.
@@ -1763,6 +1765,7 @@ SELECT * FROM test10fk;
 
 -- PK is case-insensitive, FK is case-sensitive
 CREATE TABLE test11pk (x text COLLATE case_insensitive PRIMARY KEY);
+NOTICE:  index "test11pk_pkey" cannot use deduplication
 INSERT INTO test11pk VALUES ('abc'), ('def'), ('ghi');
 CREATE TABLE test11fk (x text COLLATE case_sensitive REFERENCES test11pk (x) ON UPDATE CASCADE ON DELETE CASCADE);
 INSERT INTO test11fk VALUES ('abc');  -- ok
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6ddf3a63c3..4dc00340ba 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -42,6 +42,7 @@ CREATE INDEX bt_i4_index ON bt_i4_heap USING btree (seqno int4_ops);
 CREATE INDEX bt_name_index ON bt_name_heap USING btree (seqno name_ops);
 CREATE INDEX bt_txt_index ON bt_txt_heap USING btree (seqno text_ops);
 CREATE INDEX bt_f8_index ON bt_f8_heap USING btree (seqno float8_ops);
+NOTICE:  index "bt_f8_index" cannot use deduplication
 --
 -- BTREE partial indices
 --
@@ -1357,8 +1358,11 @@ DROP TABLE covering_index_heap;
 -- tables that already contain data.
 --
 create unique index hash_f8_index_1 on hash_f8_heap(abs(random));
+NOTICE:  index "hash_f8_index_1" cannot use deduplication
 create unique index hash_f8_index_2 on hash_f8_heap((seqno + 1), random);
+NOTICE:  index "hash_f8_index_2" cannot use deduplication
 create unique index hash_f8_index_3 on hash_f8_heap(random) where seqno > 1000;
+NOTICE:  index "hash_f8_index_3" cannot use deduplication
 --
 -- Try some concurrent index builds
 --
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 2a033a6e11..6d05aba216 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -202,6 +202,7 @@ drop domain dia;
 create type comptype as (r float8, i float8);
 create domain dcomptype as comptype;
 create table dcomptable (d1 dcomptype unique);
+NOTICE:  index "dcomptable_d1_key" cannot use deduplication
 insert into dcomptable values (row(1,2)::dcomptype);
 insert into dcomptable values (row(3,4)::comptype);
 insert into dcomptable values (row(1,2)::dcomptype);  -- fail on uniqueness
@@ -315,6 +316,7 @@ NOTICE:  drop cascades to type dcomptype
 create type comptype as (r float8, i float8);
 create domain dcomptypea as comptype[];
 create table dcomptable (d1 dcomptypea unique);
+NOTICE:  index "dcomptable_d1_key" cannot use deduplication
 insert into dcomptable values (array[row(1,2)]::dcomptypea);
 insert into dcomptable values (array[row(3,4), row(5,6)]::comptype[]);
 insert into dcomptable values (array[row(7,8)::comptype, row(9,10)::comptype]);
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index dffff88928..96e722c5b0 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -310,6 +310,7 @@ SET enable_bitmapscan = off;
 -- Btree index / opclass with the various operators
 --
 CREATE UNIQUE INDEX enumtest_btree ON enumtest USING btree (col);
+NOTICE:  index "enumtest_btree" cannot use deduplication
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
@@ -538,6 +539,7 @@ DROP FUNCTION echo_me(rainbow);
 -- RI triggers on enum types
 --
 CREATE TABLE enumtest_parent (id rainbow PRIMARY KEY);
+NOTICE:  index "enumtest_parent_pkey" cannot use deduplication
 CREATE TABLE enumtest_child (parent rainbow REFERENCES enumtest_parent);
 INSERT INTO enumtest_parent VALUES ('red');
 INSERT INTO enumtest_child VALUES ('red');
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index 9e1d749601..3da471e153 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -812,6 +812,7 @@ DROP TABLE PKTABLE;
 -- On the other hand, this should work because int implicitly promotes to
 -- numeric, and we allow promotion on the FK side
 CREATE TABLE PKTABLE (ptest1 numeric PRIMARY KEY);
+NOTICE:  index "pktable_pkey" cannot use deduplication
 INSERT INTO PKTABLE VALUES(42);
 CREATE TABLE FKTABLE (ftest1 int REFERENCES pktable);
 -- Check it actually works
@@ -1102,6 +1103,8 @@ CREATE TEMP TABLE pktable (
         id3     REAL UNIQUE,
         UNIQUE(id1, id2, id3)
 );
+NOTICE:  index "pktable_id3_key" cannot use deduplication
+NOTICE:  index "pktable_id1_id2_id3_key" cannot use deduplication
 CREATE TEMP TABLE fktable (
         x1      INT4 REFERENCES pktable(id1),
         x2      VARCHAR(4) REFERENCES pktable(id2),
@@ -1499,6 +1502,7 @@ drop table pktable2, fktable2;
 -- Test keys that "look" different but compare as equal
 --
 create table pktable2 (a float8, b float8, primary key (a, b));
+NOTICE:  index "pktable2_pkey" cannot use deduplication
 create table fktable2 (x float8, y float8, foreign key (x, y) references pktable2 (a, b) on update cascade);
 insert into pktable2 values ('-0', '-0');
 insert into fktable2 values ('-0', '-0');
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index ec1d4eaef4..213fdf59ad 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -74,6 +74,7 @@ CREATE TABLE idxpart1 PARTITION OF idxpart FOR VALUES FROM (MINVALUE) TO (MAXVAL
 CREATE INDEX partidx_abc_idx ON idxpart (a, b, c);
 INSERT INTO idxpart (a, b, c) SELECT i, i, i FROM generate_series(1, 50) i;
 ALTER TABLE idxpart ALTER COLUMN c TYPE numeric;
+NOTICE:  index "idxpart1_a_b_c_idx" cannot use deduplication
 DROP TABLE idxpart;
 -- If a table without index is attached as partition to a table with
 -- an index, the index is automatically created
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 761376b007..0963221392 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2695,6 +2695,7 @@ begin;
 create type mycomptype as (id int, v bigint);
 create temp table tidv (idv mycomptype);
 create index on tidv (idv);
+NOTICE:  index "tidv_idv_idx" cannot use deduplication
 explain (costs off)
 select a.idv, b.idv from tidv a, tidv b where a.idv = b.idv;
                         QUERY PLAN                        
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index a70cd0b7c1..41f842de91 100644
--- a/src/test/regress/expected/jsonb.out
+++ b/src/test/regress/expected/jsonb.out
@@ -3263,6 +3263,7 @@ DROP INDEX jidx;
 DROP INDEX jidx_array;
 -- btree
 CREATE INDEX jidx ON testjsonb USING btree (j);
+NOTICE:  index "jidx" cannot use deduplication
 SET enable_seqscan = off;
 SELECT count(*) FROM testjsonb WHERE j > '{"p":1}';
  count 
diff --git a/src/test/regress/expected/matview.out b/src/test/regress/expected/matview.out
index d0121a7b0b..00394f1278 100644
--- a/src/test/regress/expected/matview.out
+++ b/src/test/regress/expected/matview.out
@@ -77,6 +77,7 @@ CREATE MATERIALIZED VIEW mvtest_tmm AS SELECT sum(totamt) AS grandtot FROM mvtes
 CREATE MATERIALIZED VIEW mvtest_tvmm AS SELECT sum(totamt) AS grandtot FROM mvtest_tvm;
 CREATE UNIQUE INDEX mvtest_tvmm_expr ON mvtest_tvmm ((grandtot > 0));
 CREATE UNIQUE INDEX mvtest_tvmm_pred ON mvtest_tvmm (grandtot) WHERE grandtot < 0;
+NOTICE:  index "mvtest_tvmm_pred" cannot use deduplication
 CREATE VIEW mvtest_tvv AS SELECT sum(totamt) AS grandtot FROM mvtest_tv;
 EXPLAIN (costs off)
   CREATE MATERIALIZED VIEW mvtest_tvvm AS SELECT * FROM mvtest_tvv;
@@ -92,6 +93,7 @@ CREATE MATERIALIZED VIEW mvtest_tvvm AS SELECT * FROM mvtest_tvv;
 CREATE VIEW mvtest_tvvmv AS SELECT * FROM mvtest_tvvm;
 CREATE MATERIALIZED VIEW mvtest_bb AS SELECT * FROM mvtest_tvvmv;
 CREATE INDEX mvtest_aa ON mvtest_bb (grandtot);
+NOTICE:  index "mvtest_aa" cannot use deduplication
 -- check that plans seem reasonable
 \d+ mvtest_tvm
                            Materialized view "public.mvtest_tvm"
@@ -249,6 +251,7 @@ REFRESH MATERIALIZED VIEW CONCURRENTLY mvtest_tvmm;
 ERROR:  cannot refresh materialized view "public.mvtest_tvmm" concurrently
 HINT:  Create a unique index with no WHERE clause on one or more columns of the materialized view.
 REFRESH MATERIALIZED VIEW mvtest_tvmm;
+NOTICE:  index "mvtest_tvmm_pred" cannot use deduplication
 REFRESH MATERIALIZED VIEW mvtest_tvvm;
 EXPLAIN (costs off)
   SELECT * FROM mvtest_tmm;
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 242f817163..1ab9890e1e 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -222,6 +222,7 @@ create index on gexec_test(a)
 create index on gexec_test(b)
 create index on gexec_test(c)
 create index on gexec_test(d)
+NOTICE:  index "gexec_test_d_idx" cannot use deduplication
 -- \gexec should work in FETCH_COUNT mode too
 -- (though the fetch limit applies to the executed queries not the meta query)
 \set FETCH_COUNT 1
diff --git a/src/test/regress/expected/rangetypes.out b/src/test/regress/expected/rangetypes.out
index 220f2d96cb..a4c4f57514 100644
--- a/src/test/regress/expected/rangetypes.out
+++ b/src/test/regress/expected/rangetypes.out
@@ -186,6 +186,7 @@ select '(a,a)'::textrange;
 --
 CREATE TABLE numrange_test (nr NUMRANGE);
 create index numrange_test_btree on numrange_test(nr);
+NOTICE:  index "numrange_test_btree" cannot use deduplication
 INSERT INTO numrange_test VALUES('[,)');
 INSERT INTO numrange_test VALUES('[3,]');
 INSERT INTO numrange_test VALUES('[, 5)');
@@ -589,6 +590,7 @@ DROP TABLE numrange_test2;
 --
 CREATE TABLE textrange_test (tr textrange);
 create index textrange_test_btree on textrange_test(tr);
+NOTICE:  index "textrange_test_btree" cannot use deduplication
 INSERT INTO textrange_test VALUES('[,)');
 INSERT INTO textrange_test VALUES('["a",]');
 INSERT INTO textrange_test VALUES('[,"q")');
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 61237dfb11..deac6148f1 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -438,6 +438,7 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
 
 -- check change of column type doesn't break it
 ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
+NOTICE:  index "fdeps_abc_idx" cannot use deduplication
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
  estimated | actual 
 -----------+--------
diff --git a/src/test/regress/expected/transactions.out b/src/test/regress/expected/transactions.out
index 1b03310029..2099bf0abc 100644
--- a/src/test/regress/expected/transactions.out
+++ b/src/test/regress/expected/transactions.out
@@ -562,6 +562,7 @@ exception
   when division_by_zero then return 0;
 end$$ language plpgsql volatile;
 create table revalidate_bug (c float8 unique);
+NOTICE:  index "revalidate_bug_c_key" cannot use deduplication
 insert into revalidate_bug values (1);
 insert into revalidate_bug values (inverse(0));
 drop table revalidate_bug;
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index fe1cd9deb0..49e06037fb 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1329,6 +1329,7 @@ SELECT COUNT(*) FROM test_tsquery WHERE keyword >  'new & york';
 (1 row)
 
 CREATE UNIQUE INDEX bt_tsq ON test_tsquery (keyword);
+NOTICE:  index "bt_tsq" cannot use deduplication
 SET enable_seqscan=OFF;
 SELECT COUNT(*) FROM test_tsquery WHERE keyword <  'new & york';
  count 
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..6e14b935ce 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -86,7 +86,7 @@ reset enable_bitmapscan;
 -- Also check LIKE optimization with binary-compatible cases
 
 create temp table btree_bpchar (f1 text collate "C");
-create index on btree_bpchar(f1 bpchar_ops);
+create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
 insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
 -- doesn't match index:
 explain (costs off)
@@ -103,6 +103,26 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+-- get test coverage for "single value" deduplication strategy:
+insert into btree_bpchar select 'foo' from generate_series(1,1500);
+
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

#134

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 6 years ago

In reply to: Peter Geoghegan (#133)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On 14.02.2020 05:57, Peter Geoghegan wrote:

Attached is v33, which adds the last piece we need: opclass
infrastructure that tells nbtree whether or not deduplication can be
applied safely. This is based on work by Anastasia that was shared
with me privately.

Thank you for this work. I've looked through the patches and they seem
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe
I'll send some more feedback tomorrow.

New opclass proc
================

In general, supporting deduplication is the rule for B-Tree opclasses,
rather than the exception. Most can use the generic
btequalimagedatum() routine as their support function 4, which
unconditionally indicates that deduplication is safe. There is a new
test that tries to catch opclasses that omitted to do this. Here is
the opr_sanity.out changes added by the first patch:

-- Almost all Btree opclasses can use the generic btequalimagedatum function
-- as their equalimage proc (support function 4). Look for opclasses that
-- don't do so; newly added Btree opclasses will usually be able to support
-- deduplication with little trouble.
SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name,
opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
FROM pg_am am
JOIN pg_opclass opc ON opc.opcmethod = am.oid
JOIN pg_opfamily opf ON opc.opcfamily = opf.oid
LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND
amproclefttype = opcintype AND
amprocnum = 4
WHERE am.amname = 'btree' AND
amproc IS DISTINCT FROM 'btequalimagedatum'::regproc
ORDER BY amproc::regproc::text, opfamily_name, opclass_name;
proc | opfamily_name | opclass_name | opcintype
-------------------+------------------+------------------+------------------
bpchar_equalimage | bpchar_ops | bpchar_ops | character
btnameequalimage | text_ops | name_ops | name
bttextequalimage | text_ops | text_ops | text
bttextequalimage | text_ops | varchar_ops | text
| array_ops | array_ops | anyarray
| enum_ops | enum_ops | anyenum
| float_ops | float4_ops | real
| float_ops | float8_ops | double precision
| jsonb_ops | jsonb_ops | jsonb
| money_ops | money_ops | money
| numeric_ops | numeric_ops | numeric
| range_ops | range_ops | anyrange
| record_image_ops | record_image_ops | record
| record_ops | record_ops | record
| tsquery_ops | tsquery_ops | tsquery
| tsvector_ops | tsvector_ops | tsvector
(16 rows)

Is there any specific reason, why we need separate btnameequalimage,
bpchar_equalimage and bttextequalimage functions?
As far as I see, they have the same implementation.

Since using deduplication is supposed to pretty much be the norm from
now on, it seemed like it might make sense to add a NOTICE about it
during CREATE INDEX -- a notice letting the user know that it isn't
being used due to a lack of opclass support:

regression=# create table foo(bar numeric);
CREATE TABLE
regression=# create index on foo(bar);
NOTICE: index "foo_bar_idx" cannot use deduplication
CREATE INDEX

Note that this NOTICE isn't seen with an INCLUDE index, since that's
expected to not support deduplication.

I have a feeling that not everybody will like this, which is why I'm
pointing it out.

Thoughts?

I would simply move it to debug level for all cases. Since from user's
perspective it doesn't differ that much from the case where
deduplication is applicable in general, but not very efficient due to
data distribution.
I also noticed that this is not consistent with ALTER index. For
example, alter index idx_n set (deduplicate_items =true); won't show any
message about deduplication.

I've tried several combinations with an index on a numeric column:

1) postgres=# create index idx_nd on tbl (n) with (deduplicate_items =
true);
NOTICE: index "idx_nd" cannot use deduplication
CREATE INDEX

Here the message seems appropriate. I don't think, we should restrict
creation of the index even when deduplicate_items parameter is set
explicitly, rather we may warn the user that it won't be efficient.

2) postgres=# create index idx_n on tbl (n) with (deduplicate_items =
false);
NOTICE: index "idx_n" cannot use deduplication
CREATE INDEX

In this case the message seems slightly strange to me.
Why should we show a notice about the fact that deduplication is not
possible if that is exactly what was requested?

3)
postgres=# create index idx on tbl (n);
NOTICE: index "idx" cannot use deduplication

In my opinion, this message is too specific for default behavior. It
exposes internal details without explanation and may look to user like
something went wrong.

#135

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Anastasia Lubennikova (#134)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Thank you for this work. I've looked through the patches and they seem
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe
I'll send some more feedback tomorrow.

Great.

Is there any specific reason, why we need separate btnameequalimage,
bpchar_equalimage and bttextequalimage functions?
As far as I see, they have the same implementation.

Not really. This approach allows us to reverse the decision to enable
deduplication in a point release, which is theoretically useful. OTOH,
if that's so important, why not have many more support function 4
implementations (one per opclass)?

I suspect that we would just disable deduplication in a hard-coded
fashion if we needed to disable it due to some issue that transpired.
For example, we could do this by modifying _bt_allequalimage() itself.

I would simply move it to debug level for all cases. Since from user's
perspective it doesn't differ that much from the case where
deduplication is applicable in general, but not very efficient due to
data distribution.

I was more concerned about cases where the user would really like to
use deduplication, but wants to make sure that it gets used. And
doesn't want to install pageinspect to find out.

I also noticed that this is not consistent with ALTER index. For
example, alter index idx_n set (deduplicate_items =true); won't show any
message about deduplication.

But that's a change in the user's preference. Not a change in whether
or not it's safe in principle.

In my opinion, this message is too specific for default behavior. It
exposes internal details without explanation and may look to user like
something went wrong.

You're probably right about that. I just wish that there was some way
of showing the same information that was discoverable, and didn't
require the use of pageinspect. If I make it a DEBUG1 message, then it
cannot really be documented.

--
Peter Geoghegan

#136

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#135)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Feb 19, 2020 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Thank you for this work. I've looked through the patches and they seem
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe
I'll send some more feedback tomorrow.

Great.

I should add: I plan to commit the patch within the next 7 days.

I believe that the design of deduplication itself is solid; it has
many more strengths than weaknesses. It works in a way that
complements the existing approach to page splits. The optimization can
easily be turned off (and easily turned back on again).
contrib/amcheck can detect almost any possible form of corruption that
could affect a B-Tree index that has posting list tuples. I have spent
months microbenchmarking every little aspect of this patch in
isolation. I've also spent a lot of time on conventional benchmarking.

It seems quite possible that somebody won't like some aspect of the
user interface. I am more than willing to work with other contributors
on any issue in that area that comes to light. I don't see any point
in waiting for other hackers to speak up before the patch is
committed, though. Anastasia posted the first version of this patch in
August of 2015, and there have been over 30 revisions of it since the
project was revived in 2019. Everyone has been given ample opportunity
to offer input.

--
Peter Geoghegan

#137

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 6 years ago

In reply to: Peter Geoghegan (#135)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On 19.02.2020 22:16, Peter Geoghegan wrote:

On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Thank you for this work. I've looked through the patches and they seem
to be ready for commit.
I haven't yet read recent documentation and readme changes, so maybe
I'll send some more feedback tomorrow.

The only thing I found is a typo in the comment

+ int nhtids; /* Number of heap TIDs in nhtids array */

s/nhtids/htids

I don't think this patch really needs more nitpicking )

In my opinion, this message is too specific for default behavior. It
exposes internal details without explanation and may look to user like
something went wrong.

You're probably right about that. I just wish that there was some way
of showing the same information that was discoverable, and didn't
require the use of pageinspect. If I make it a DEBUG1 message, then it
cannot really be documented.

User can discover this with a complex query to pg_index and pg_opclass.
To simplify this, we can probably wrap this into function or some field
in pg_indexes.
Anyway, I would wait for feedback from pre-release testers.

#138

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Anastasia Lubennikova (#137)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Feb 20, 2020 at 7:38 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I don't think this patch really needs more nitpicking )

But when has that ever stopped it? :-)

User can discover this with a complex query to pg_index and pg_opclass.
To simplify this, we can probably wrap this into function or some field
in pg_indexes.

A function isn't a real user interface, though -- it probably won't be noticed.

I think that there is a good chance that it just won't matter. The
number of indexes that won't be able to support deduplication will be
very small in practice. The important exceptions are INCLUDE indexes
and nondeterministic collations. These exceptions make sense
intuitively, and will be documented as limitations of those other
features.

The numeric/float thing doesn't really make intuitive sense, and
numeric is an important datatype. Still, numeric columns and float
columns seem to rarely get indexed. That just leaves container type
opclasses, like anyarray and jsonb.

Nobody cares about indexing container types with a B-Tree index, with
the possible exception of expression indexes on a jsonb column. I
don't see a way around that, but it doesn't seem all that important.
Again, applications are unlikely to have more than one or two of
those. The *overall* space saving will probably be almost as good as
if the limitation did not exist.

Anyway, I would wait for feedback from pre-release testers.

Right -- let's delay making a final decision on it. Just like the
decision to enable it by default. It will work this way in the
committed version, but that isn't supposed to be the final word on it.

--
Peter Geoghegan

#139

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#138)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Feb 20, 2020 at 10:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

I think that there is a good chance that it just won't matter. The
number of indexes that won't be able to support deduplication will be
very small in practice. The important exceptions are INCLUDE indexes
and nondeterministic collations. These exceptions make sense
intuitively, and will be documented as limitations of those other
features.

I wasn't clear about the implication of what I was saying here, which
is: I will make the NOTICE a DEBUG1 message, and leave everything else
as-is in the initial committed version.

--
Peter Geoghegan

#140

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#139)

4 attachment(s)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Thu, Feb 20, 2020 at 12:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

I wasn't clear about the implication of what I was saying here, which
is: I will make the NOTICE a DEBUG1 message, and leave everything else
as-is in the initial committed version.

Attached is v34, which has this change. My plan is to commit something
very close to this on Wednesday morning (barring any objections).

Other changes:

* Now, equalimage functions take a pg_type OID argument, allowing us
to reuse the same generic pg_proc-wise function across many of the
operator classes from the core distribution.

* Rewrote the docs for equalimage functions in the 0001-* patch.

* Lots of copy-editing of the "Implementation" section of the B-Tree
doc chapter, most of which is about deduplication specifically.

--
Peter Geoghegan

Attachments:

v34-0003-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/octet-stream; name=v34-0003-Teach-pageinspect-about-nbtree-posting-lists.patchDownload

From 22d5a2f78a4c2f4ff2ea5bddd4743cf483a258ab Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 16 Feb 2020 01:16:02 -0800
Subject: [PATCH v34 3/4] Teach pageinspect about nbtree posting lists.

Add a column for posting list TIDs to bt_page_items().  Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID.  In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.

Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.

No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
 contrib/pageinspect/btreefuncs.c              | 119 +++++++++++++++---
 contrib/pageinspect/expected/btree.out        |   7 ++
 contrib/pageinspect/pageinspect--1.7--1.8.sql |  53 ++++++++
 doc/src/sgml/pageinspect.sgml                 |  83 ++++++------
 4 files changed, 207 insertions(+), 55 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 564c818558..f4aac890f5 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
 #include "access/relation.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
 #define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X)	 ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X)	 PointerGetDatum(X)
 
 /* note: BlockNumber is unsigned, hence can't be negative */
 #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
 {
 	Page		page;
 	OffsetNumber offset;
+	bool		leafpage;
+	bool		rightmost;
+	TupleDesc	tupd;
 };
 
 /*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
-	char	   *values[6];
+	Page		page = uargs->page;
+	OffsetNumber offset = uargs->offset;
+	bool		leafpage = uargs->leafpage;
+	bool		rightmost = uargs->rightmost;
+	bool		pivotoffset;
+	Datum		values[9];
+	bool		nulls[9];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
 	int			j;
 	int			off;
 	int			dlen;
-	char	   *dump;
+	char	   *dump,
+			   *datacstring;
 	char	   *ptr;
+	ItemPointer htid;
 
 	id = PageGetItemId(page, offset);
 
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	itup = (IndexTuple) PageGetItem(page, id);
 
 	j = 0;
-	values[j++] = psprintf("%d", offset);
-	values[j++] = psprintf("(%u,%u)",
-						   ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
-						   ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
-	values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
-	values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
-	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+	memset(nulls, 0, sizeof(nulls));
+	values[j++] = DatumGetInt16(offset);
+	values[j++] = ItemPointerGetDatum(&itup->t_tid);
+	values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
 	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+	/*
+	 * Make sure that "data" column does not include posting list or pivot
+	 * tuple representation of heap TID
+	 */
+	if (BTreeTupleIsPosting(itup))
+		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+		dlen -= MAXALIGN(sizeof(ItemPointerData));
+
 	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
+	datacstring = dump;
 	for (off = 0; off < dlen; off++)
 	{
 		if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 		sprintf(dump, "%02x", *(ptr + off) & 0xff);
 		dump += 2;
 	}
+	values[j++] = CStringGetTextDatum(datacstring);
+	pfree(datacstring);
 
-	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+	/*
+	 * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+	 * have v4+ status bit set) is dead or has a heap TID -- that can only
+	 * happen with non-pivot tuples.  (Most backend code can use the
+	 * heapkeyspace field from the metapage to figure out which representation
+	 * to expect, but we have to be a bit creative here.)
+	 */
+	pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+	/* LP_DEAD status bit */
+	if (!pivotoffset)
+		values[j++] = BoolGetDatum(ItemIdIsDead(id));
+	else
+		nulls[j++] = true;
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (pivotoffset && !BTreeTupleIsPivot(itup))
+		htid = NULL;
+
+	if (htid)
+		values[j++] = ItemPointerGetDatum(htid);
+	else
+		nulls[j++] = true;
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		/* build an array of item pointers */
+		ItemPointer tids;
+		Datum	   *tids_datum;
+		int			nposting;
+
+		tids = BTreeTupleGetPosting(itup);
+		nposting = BTreeTupleGetNPosting(itup);
+		tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+		for (int i = 0; i < nposting; i++)
+			tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+		values[j++] = PointerGetDatum(construct_array(tids_datum,
+													  nposting,
+													  TIDOID,
+													  sizeof(ItemPointerData),
+													  false, 's'));
+		pfree(tids_datum);
+	}
+	else
+		nulls[j++] = true;
+
+	/* Build and return the result tuple */
+	tuple = heap_form_tuple(uargs->tupd, values, nulls);
 
 	return HeapTupleGetDatum(tuple);
 }
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 			elog(NOTICE, "page is deleted");
 
 		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		uargs->leafpage = P_ISLEAF(opaque);
+		uargs->rightmost = P_RIGHTMOST(opaque);
 
 		/* Build a tuple descriptor for our result type */
 		if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
 			elog(ERROR, "return type must be a row type");
-
-		fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+		uargs->tupd = tupleDesc;
 
 		fctx->user_fctx = uargs;
 
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
@@ -510,7 +585,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[8];
+	char	   *values[9];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -557,17 +632,21 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * Get values of extended metadata if available, use default values
-	 * otherwise.
+	 * otherwise.  Note that we rely on the assumption that btm_allequalimage
+	 * is initialized to zero on databases that were initdb'd before Postgres
+	 * 13.
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
+		values[j++] = metad->btm_allequalimage ? "t" : "f";
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
+		values[j++] = "f";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..17bf0c5470 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -12,6 +12,7 @@ fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
+allequalimage           | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
@@ -41,6 +42,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
@@ -54,6 +58,9 @@ itemlen    | 16
 nulls      | f
 vars       | f
 data       | 01 00 00 00 00 00 00 01
+dead       | f
+htid       | (0,1)
+tids       | 
 
 SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
 ERROR:  block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..e34c214c93 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,56 @@ CREATE FUNCTION heap_tuple_infomask_flags(
 RETURNS record
 AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int4,
+    OUT level int4,
+    OUT fastroot int4,
+    OUT fastlevel int4,
+    OUT oldest_xact int4,
+    OUT last_cleanup_num_tuples real,
+    OUT allequalimage boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT dead boolean,
+    OUT htid tid,
+    OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..9558421c2f 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -300,13 +300,14 @@ test=# SELECT t_ctid, raw_flags, combined_flags
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
 fastlevel               | 0
 oldest_xact             | 582
 last_cleanup_num_tuples | 1000
+allequalimage           | f
 </screen>
      </para>
     </listitem>
@@ -329,11 +330,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
 -[ RECORD 1 ]-+-----
 blkno         | 1
 type          | l
-live_items    | 256
+live_items    | 224
 dead_items    | 0
-avg_item_size | 12
+avg_item_size | 16
 page_size     | 8192
-free_size     | 4056
+free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
 btpo          | 0
@@ -356,33 +357,45 @@ btpo_flags    | 3
       <function>bt_page_items</function> returns detailed information about
       all of the items on a B-tree index page.  For example:
 <screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
-      In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
-      In an internal page, the block number part of <structfield>ctid</structfield>
-      points to another page in the index itself, while the offset part
-      (the second number) is ignored and is usually 1.
+      In a B-tree leaf page, <structfield>ctid</structfield> usually
+      points to a heap tuple, and <structfield>dead</structfield> may
+      indicate that the item has its <literal>LP_DEAD</literal> bit
+      set.  In an internal page, the block number part of
+      <structfield>ctid</structfield> points to another page in the
+      index itself, while the offset part (the second number) encodes
+      metadata about the tuple.  Posting list tuples on leaf pages
+      also use <structfield>ctid</structfield> for metadata.
+      <structfield>htid</structfield> always shows a single heap TID
+      for the tuple, regardless of how it is represented (internal
+      page tuples may need to store a heap TID when there are many
+      duplicate tuples on descendent leaf pages).
+      <structfield>tids</structfield> is a list of TIDs that is stored
+      within posting list tuples (tuples created by deduplication).
      </para>
      <para>
       Note that the first item on any non-rightmost page (any page with
       a non-zero value in the <structfield>btpo_next</structfield> field) is the
       page's <quote>high key</quote>, meaning its <structfield>data</structfield>
       serves as an upper bound on all items appearing on the page, while
-      its <structfield>ctid</structfield> field is meaningless.  Also, on non-leaf
-      pages, the first real data item (the first item that is not a high
-      key) is a <quote>minus infinity</quote> item, with no actual value
-      in its <structfield>data</structfield> field.  Such an item does have a valid
-      downlink in its <structfield>ctid</structfield> field, however.
+      its <structfield>ctid</structfield> field does not point to
+      another block.  Also, on non-leaf pages, the first real data item
+      (the first item that is not a high key) is a <quote>minus
+      infinity</quote> item, with no actual value in its
+      <structfield>data</structfield> field.  Such an item does have a
+      valid downlink in its <structfield>ctid</structfield> field,
+      however.
      </para>
     </listitem>
    </varlistentry>
@@ -402,17 +415,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
       with <function>get_raw_page</function> should be passed as argument.  So
       the last example could also be rewritten like this:
 <screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset |  ctid   | itemlen | nulls | vars |    data
-------------+---------+---------+-------+------+-------------
-          1 | (0,1)   |      12 | f     | f    | 23 27 00 00
-          2 | (0,2)   |      12 | f     | f    | 24 27 00 00
-          3 | (0,3)   |      12 | f     | f    | 25 27 00 00
-          4 | (0,4)   |      12 | f     | f    | 26 27 00 00
-          5 | (0,5)   |      12 | f     | f    | 27 27 00 00
-          6 | (0,6)   |      12 | f     | f    | 28 27 00 00
-          7 | (0,7)   |      12 | f     | f    | 29 27 00 00
-          8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset |   ctid   | itemlen | nulls | vars |          data           | dead |   htid   | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+          1 | (40,1)   |      16 | f     | f    | b8 05 00 00 00 00 00 00 |      |          |
+          2 | (58,11)  |      16 | f     | f    | 4a 04 00 00 00 00 00 00 | f    | (58,11)  |
+          3 | (266,4)  |      16 | f     | f    | 4b 04 00 00 00 00 00 00 | f    | (266,4)  |
+          4 | (279,25) |      16 | f     | f    | 4c 04 00 00 00 00 00 00 | f    | (279,25) |
+          5 | (333,11) |      16 | f     | f    | 4d 04 00 00 00 00 00 00 | f    | (333,11) |
+          6 | (87,24)  |      16 | f     | f    | 4e 04 00 00 00 00 00 00 | f    | (87,24)  |
+          7 | (38,22)  |      16 | f     | f    | 4f 04 00 00 00 00 00 00 | f    | (38,22)  |
+          8 | (272,17) |      16 | f     | f    | 50 04 00 00 00 00 00 00 | f    | (272,17) |
 </screen>
       All the other details are the same as explained in the previous item.
      </para>
-- 
2.17.1

v34-0004-DEBUG-Show-index-values-in-pageinspect.patchapplication/octet-stream; name=v34-0004-DEBUG-Show-index-values-in-pageinspect.patchDownload

From 572f7d7752ac223c3babbbbc3b9538cbac30fa70 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 16 Feb 2020 01:16:02 -0800
Subject: [PATCH v34 4/4] DEBUG: Show index values in pageinspect

This is not intended for commit.  It is included as a convenience for
reviewers.
---
 contrib/pageinspect/btreefuncs.c       | 64 ++++++++++++++++++--------
 contrib/pageinspect/expected/btree.out |  2 +-
 2 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index f4aac890f5..9074033619 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -245,6 +245,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 	bool		leafpage;
@@ -261,6 +262,7 @@ struct user_args
 static Datum
 bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 {
+	Relation	rel = uargs->rel;
 	Page		page = uargs->page;
 	OffsetNumber offset = uargs->offset;
 	bool		leafpage = uargs->leafpage;
@@ -295,26 +297,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
 	values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
-	/*
-	 * Make sure that "data" column does not include posting list or pivot
-	 * tuple representation of heap TID
-	 */
-	if (BTreeTupleIsPosting(itup))
-		dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
-	else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
-		dlen -= MAXALIGN(sizeof(ItemPointerData));
-
-	dump = palloc0(dlen * 3 + 1);
-	datacstring = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		rel->rd_index->indnkeyatts = natts;
+		datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		/*
+		 * Make sure that "data" column does not include posting list or pivot
+		 * tuple representation of heap TID
+		 */
+		if (BTreeTupleIsPosting(itup))
+			dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+		else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+			dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+		dump = palloc0(dlen * 3 + 1);
+		datacstring = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
 	values[j++] = CStringGetTextDatum(datacstring);
 	pfree(datacstring);
 
@@ -437,11 +461,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -475,6 +499,7 @@ bt_page_items(PG_FUNCTION_ARGS)
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -522,6 +547,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = NULL;
 		uargs->page = VARDATA(raw_page);
 
 		uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 17bf0c5470..92ad8eb1a9 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,7 +41,7 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
 dead       | f
 htid       | (0,1)
 tids       | 
-- 
2.17.1

v34-0002-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v34-0002-Add-deduplication-to-nbtree.patchDownload

From 9b249d0c41a110f315dfe9d2f361e39c54254f9f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 16 Feb 2020 01:16:02 -0800
Subject: [PATCH v34 2/4] Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
"posting list tuples" are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.

Deduplication merges together duplicates that happen to have been
created by an UPDATE that did not use an optimization like heapam's
Heap-only tuples (HOT).  Deduplication is effective at absorbing
"version bloat" without any special knowledge of row versions or of
MVCC.  Deduplication is applied within unique indexes for this reason,
though the criteria for triggering a deduplication is slightly
different.  Deduplication of a unique index is triggered only when the
incoming item is a duplicate of an existing item (and when the page
would otherwise split), which is a sure sign of "version bloat".

The lazy approach taken by nbtree has significant advantages over a
GIN style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit which made heap TID a tiebreaker
column), since only the physical representation of tuples is changed.  A
new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all B-Tree indexes use
deduplication when only "equalimage" operator classes are used.  We
should review this decision at the end of the Postgres 13 beta period.

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

There is a regression of approximately 2% of transaction throughput with
workloads that consist of append-only inserts into a table with several
non-unique indexes, where all indexes have few or no repeated values.
The underlying issue is that cycles are wasted on unsuccessful attempts
at deduplicating items in non-unique indexes.  There doesn't seem to be
a way around it short of disabling deduplication entirely.  Note that
deduplication of items in unique indexes is fairly well targeted in
general, which avoids wasting cycles in the insert path of unique
indexes.

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since deduplication only affects the physical
representation of tuples.  However, users must still REINDEX a
pg_upgrade'd index to use deduplication.  This is the only way to set
the new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
---
 src/include/access/nbtree.h               | 435 ++++++++++--
 src/include/access/nbtxlog.h              | 117 ++-
 src/include/access/rmgrlist.h             |   2 +-
 src/backend/access/common/reloptions.c    |   9 +
 src/backend/access/index/genam.c          |   4 +
 src/backend/access/nbtree/Makefile        |   1 +
 src/backend/access/nbtree/README          | 133 +++-
 src/backend/access/nbtree/nbtdedup.c      | 830 ++++++++++++++++++++++
 src/backend/access/nbtree/nbtinsert.c     | 388 ++++++++--
 src/backend/access/nbtree/nbtpage.c       | 246 ++++++-
 src/backend/access/nbtree/nbtree.c        | 171 ++++-
 src/backend/access/nbtree/nbtsearch.c     | 272 ++++++-
 src/backend/access/nbtree/nbtsort.c       | 193 ++++-
 src/backend/access/nbtree/nbtsplitloc.c   |  39 +-
 src/backend/access/nbtree/nbtutils.c      | 201 +++++-
 src/backend/access/nbtree/nbtxlog.c       | 268 ++++++-
 src/backend/access/rmgrdesc/nbtdesc.c     |  22 +-
 src/backend/storage/page/bufpage.c        |  11 +-
 src/bin/psql/tab-complete.c               |   4 +-
 contrib/amcheck/verify_nbtree.c           | 231 ++++--
 doc/src/sgml/btree.sgml                   | 219 +++++-
 doc/src/sgml/charset.sgml                 |   9 +-
 doc/src/sgml/citext.sgml                  |   7 +-
 doc/src/sgml/func.sgml                    |   9 +-
 doc/src/sgml/ref/create_index.sgml        |  44 +-
 src/test/regress/expected/btree_index.out |  20 +-
 src/test/regress/sql/btree_index.sql      |  22 +-
 27 files changed, 3572 insertions(+), 335 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtdedup.c

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d520066914..13d4eea84b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -108,6 +108,7 @@ typedef struct BTMetaPageData
 										 * pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
+	bool		btm_allequalimage;	/* are all columns "equalimage"? */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
@@ -124,6 +125,14 @@ typedef struct BTMetaPageData
  * need to be immediately re-indexed at pg_upgrade.  In order to get the
  * new heapkeyspace semantics, however, a REINDEX is needed.
  *
+ * Deduplication is safe to use when the btm_allequalimage field is set to
+ * true.  It's safe to read the btm_allequalimage field on version 3, but
+ * only version 4 indexes make use of deduplication.  Even version 4
+ * indexes created on PostgreSQL v12 will need a REINDEX to make use of
+ * deduplication, though, since there is no other way to set
+ * btm_allequalimage to true (pg_upgrade hasn't been taught to set the
+ * metapage field).
+ *
  * Btree version 2 is mostly the same as version 3.  There are two new
  * fields in the metapage that were introduced in version 3.  A version 2
  * metapage will be automatically upgraded to version 3 on the first
@@ -156,6 +165,21 @@ typedef struct BTMetaPageData
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
+/*
+ * MaxTIDsPerBTreePage is an upper bound on the number of heap TIDs tuples
+ * that may be stored on a btree leaf page.  It is used to size the
+ * per-page temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-tuple overheads here to keep
+ * things simple (value is based on how many elements a single array of
+ * heap TIDs must have to fill the space between the page header and
+ * special area).  The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable.
+ */
+#define MaxTIDsPerBTreePage \
+	(int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+		   sizeof(ItemPointerData))
+
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
@@ -230,16 +254,15 @@ typedef struct BTMetaPageData
  * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
  * described here.
  *
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
  *
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
  * separated.  In general, any B-Tree index that has more than one level
  * (i.e. any index that does not just consist of a metapage and a single
  * leaf root page) must have some number of pivot tuples, since pivot
@@ -264,7 +287,8 @@ typedef struct BTMetaPageData
  * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
  * TID column sometimes stored in pivot tuples -- that's represented by
  * the presence of BT_PIVOT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in
- * t_info is always set on BTREE_VERSION 4 pivot tuples.
+ * t_info is always set on BTREE_VERSION 4 pivot tuples, since
+ * BTreeTupleIsPivot() must work reliably on heapkeyspace versions.
  *
  * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
  * pivot tuples.  In that case, the number of key columns is implicitly
@@ -279,90 +303,256 @@ typedef struct BTMetaPageData
  * The 12 least significant offset bits from t_tid are used to represent
  * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
- * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
- * number of columns/attributes <= INDEX_MAX_KEYS.
+ * future use.  BT_OFFSET_MASK should be large enough to store any number
+ * of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID.  PostgreSQL v13 introduced a
+ * new non-pivot tuple format to support deduplication: posting list
+ * tuples.  Deduplication merges together multiple equal non-pivot tuples
+ * into a logically equivalent, space efficient representation.  A posting
+ * list is an array of ItemPointerData elements.  Non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ *  t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid.  These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).  BT_OFFSET_MASK should be large enough to store
+ * any number of posting list TIDs that might be present in a tuple (since
+ * tuple size is subject to the INDEX_SIZE_MASK limit).
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
-#define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_OFFSET_MASK				0x0FFF
 #define BT_PIVOT_HEAP_TID_ATTR		0x1000
-
-/* Get/set downlink block number in pivot tuple */
-#define BTreeTupleGetDownLink(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetDownLink(itup, blkno) \
-	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
+#define BT_IS_POSTING				0x2000
 
 /*
- * Get/set leaf page highkey's link. During the second phase of deletion, the
- * target leaf page's high key may point to an ancestor page (at all other
- * times, the leaf level high key's link is not used).  See the nbtree README
- * for full details.
+ * Note: BTreeTupleIsPivot() can have false negatives (but not false
+ * positives) when used with !heapkeyspace indexes
  */
-#define BTreeTupleGetTopParent(itup) \
-	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
-#define BTreeTupleSetTopParent(itup, blkno)	\
-	do { \
-		ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno)); \
-		BTreeTupleSetNAtts((itup), 0); \
-	} while(0)
+static inline bool
+BTreeTupleIsPivot(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* absence of BT_IS_POSTING in offset number indicates pivot tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0)
+		return false;
+
+	return true;
+}
+
+static inline bool
+BTreeTupleIsPosting(IndexTuple itup)
+{
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+	/* presence of BT_IS_POSTING in offset number indicates posting tuple */
+	if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) == 0)
+		return false;
+
+	return true;
+}
+
+static inline void
+BTreeTupleSetPosting(IndexTuple itup, int nhtids, int postingoffset)
+{
+	Assert(nhtids > 1 && (nhtids & BT_OFFSET_MASK) == nhtids);
+	Assert(postingoffset == MAXALIGN(postingoffset));
+	Assert(postingoffset < INDEX_SIZE_MASK);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, (nhtids | BT_IS_POSTING));
+	ItemPointerSetBlockNumber(&itup->t_tid, postingoffset);
+}
+
+static inline uint16
+BTreeTupleGetNPosting(IndexTuple posting)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPosting(posting));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&posting->t_tid);
+	return (existing & BT_OFFSET_MASK);
+}
+
+static inline uint32
+BTreeTupleGetPostingOffset(IndexTuple posting)
+{
+	Assert(BTreeTupleIsPosting(posting));
+
+	return ItemPointerGetBlockNumberNoCheck(&posting->t_tid);
+}
+
+static inline ItemPointer
+BTreeTupleGetPosting(IndexTuple posting)
+{
+	return (ItemPointer) ((char *) posting +
+						  BTreeTupleGetPostingOffset(posting));
+}
+
+static inline ItemPointer
+BTreeTupleGetPostingN(IndexTuple posting, int n)
+{
+	return BTreeTupleGetPosting(posting) + n;
+}
 
 /*
- * Get/set number of attributes within B-tree index tuple.
+ * Get/set downlink block number in pivot tuple.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetDownLink(IndexTuple pivot)
+{
+	return ItemPointerGetBlockNumberNoCheck(&pivot->t_tid);
+}
+
+static inline void
+BTreeTupleSetDownLink(IndexTuple pivot, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&pivot->t_tid, blkno);
+}
+
+/*
+ * Get number of attributes within tuple.
  *
  * Note that this does not include an implicit tiebreaker heap TID
  * attribute, if any.  Note also that the number of key attributes must be
  * explicitly represented in all heapkeyspace pivot tuples.
+ *
+ * Note: This is defined as a macro rather than an inline function to
+ * avoid including rel.h.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		(BTreeTupleIsPivot(itup)) ? \
 		( \
-			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Set number of attributes in tuple, making it into a pivot tuple
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_PIVOT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int natts)
+{
+	Assert(natts <= INDEX_MAX_KEYS);
+
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	/* BT_IS_POSTING bit may be unset -- tuple always becomes a pivot tuple */
+	ItemPointerSetOffsetNumber(&itup->t_tid, natts);
+	Assert(BTreeTupleIsPivot(itup));
+}
+
 /*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Set the bit indicating heap TID attribute present in pivot tuple
  */
-#define BTreeTupleSetAltHeapTID(itup) \
-	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
-								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_PIVOT_HEAP_TID_ATTR); \
-	} while(0)
+static inline void
+BTreeTupleSetAltHeapTID(IndexTuple pivot)
+{
+	OffsetNumber existing;
+
+	Assert(BTreeTupleIsPivot(pivot));
+
+	existing = ItemPointerGetOffsetNumberNoCheck(&pivot->t_tid);
+	ItemPointerSetOffsetNumber(&pivot->t_tid,
+							   existing | BT_PIVOT_HEAP_TID_ATTR);
+}
+
+/*
+ * Get/set leaf page's "top parent" link from its high key.  Used during page
+ * deletion.
+ *
+ * Note: Cannot assert that tuple is a pivot tuple.  If we did so then
+ * !heapkeyspace indexes would exhibit false positive assertion failures.
+ */
+static inline BlockNumber
+BTreeTupleGetTopParent(IndexTuple leafhikey)
+{
+	return ItemPointerGetBlockNumberNoCheck(&leafhikey->t_tid);
+}
+
+static inline void
+BTreeTupleSetTopParent(IndexTuple leafhikey, BlockNumber blkno)
+{
+	ItemPointerSetBlockNumber(&leafhikey->t_tid, blkno);
+	BTreeTupleSetNAtts(leafhikey, 0);
+}
+
+/*
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_PIVOT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &itup->t_tid;
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.
+ *
+ * Works with non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+	{
+		uint16		nposting = BTreeTupleGetNPosting(itup);
+
+		return BTreeTupleGetPostingN(itup, nposting - 1);
+	}
+
+	return &itup->t_tid;
+}
 
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
@@ -444,6 +634,9 @@ typedef BTStackData *BTStack;
  * indexes whose version is >= version 4.  It's convenient to keep this close
  * by, rather than accessing the metapage repeatedly.
  *
+ * allequalimage is set to indicate that deduplication is safe for the index.
+ * This is also a property of the index relation rather than an indexscan.
+ *
  * anynullkeys indicates if any of the keys had NULL value when scankey was
  * built from index tuple (note that already-truncated tuple key attributes
  * set NULL as a placeholder key value, which also affects value of
@@ -479,6 +672,7 @@ typedef BTStackData *BTStack;
 typedef struct BTScanInsertData
 {
 	bool		heapkeyspace;
+	bool		allequalimage;
 	bool		anynullkeys;
 	bool		nextkey;
 	bool		pivotsearch;
@@ -517,10 +711,94 @@ typedef struct BTInsertStateData
 	bool		bounds_valid;
 	OffsetNumber low;
 	OffsetNumber stricthigh;
+
+	/*
+	 * if _bt_binsrch_insert found the location inside existing posting list,
+	 * save the position inside the list.  -1 sentinel value indicates overlap
+	 * with an existing posting list tuple that has its LP_DEAD bit set.
+	 */
+	int			postingoff;
 } BTInsertStateData;
 
 typedef BTInsertStateData *BTInsertState;
 
+/*
+ * State used to representing an individual pending tuple during
+ * deduplication.
+ */
+typedef struct BTDedupInterval
+{
+	OffsetNumber baseoff;
+	uint16		nitems;
+} BTDedupInterval;
+
+/*
+ * BTDedupStateData is a working area used during deduplication.
+ *
+ * The status info fields track the state of a whole-page deduplication pass.
+ * State about the current pending posting list is also tracked.
+ *
+ * A pending posting list is comprised of a contiguous group of equal items
+ * from the page, starting from page offset number 'baseoff'.  This is the
+ * offset number of the "base" tuple for new posting list.  'nitems' is the
+ * current total number of existing items from the page that will be merged to
+ * make a new posting list tuple, including the base tuple item.  (Existing
+ * items may themselves be posting list tuples, or regular non-pivot tuples.)
+ *
+ * The total size of the existing tuples to be freed when pending posting list
+ * is processed gets tracked by 'phystupsize'.  This information allows
+ * deduplication to calculate the space saving for each new posting list
+ * tuple, and for the entire pass over the page as a whole.
+ */
+typedef struct BTDedupStateData
+{
+	/* Deduplication status info for entire pass over page */
+	bool		deduplicate;	/* Still deduplicating page? */
+	Size		maxpostingsize; /* Limit on size of final tuple */
+
+	/* Metadata about base tuple of current pending posting list */
+	IndexTuple	base;			/* Use to form new posting list */
+	OffsetNumber baseoff;		/* page offset of base */
+	Size		basetupsize;	/* base size without original posting list */
+
+	/* Other metadata about pending posting list */
+	ItemPointer htids;			/* Heap TIDs in pending posting list */
+	int			nhtids;			/* Number of heap TIDs in htids array */
+	int			nitems;			/* Number of existing tuples/line pointers */
+	Size		phystupsize;	/* Includes line pointer overhead */
+
+	/*
+	 * Array of tuples to go on new version of the page.  Contains one entry
+	 * for each group of consecutive items.  Note that existing tuples that
+	 * will not become posting list tuples do not appear in the array (they
+	 * are implicitly unchanged by deduplication pass).
+	 */
+	int			nintervals;		/* current size of intervals array */
+	BTDedupInterval intervals[MaxIndexTuplesPerPage];
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
+/*
+ * BTVacuumPostingData is state that represents how to VACUUM a posting list
+ * tuple when some (though not all) of its TIDs are to be deleted.
+ *
+ * Convention is that itup field is the original posting list tuple on input,
+ * and palloc()'d final tuple used to overwrite existing tuple on output.
+ */
+typedef struct BTVacuumPostingData
+{
+	/* Tuple that will be/was updated */
+	IndexTuple	itup;
+	OffsetNumber updatedoffset;
+
+	/* State needed to describe final itup in WAL */
+	uint16		ndeletedtids;
+	uint16		deletetids[FLEXIBLE_ARRAY_MEMBER];
+} BTVacuumPostingData;
+
+typedef BTVacuumPostingData *BTVacuumPosting;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -544,7 +822,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each TID in the posting list
+ * tuple.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -588,7 +868,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxTIDsPerBTreePage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -696,6 +976,7 @@ typedef struct BTOptions
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	/* fraction of newly inserted tuples prior to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
+	bool		deduplicate_items;	/* Try to deduplicate items? */
 } BTOptions;
 
 #define BTGetFillFactor(relation) \
@@ -706,6 +987,11 @@ typedef struct BTOptions
 	 BTREE_DEFAULT_FILLFACTOR)
 #define BTGetTargetPageFreeSpace(relation) \
 	(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetDeduplicateItems(relation) \
+	(AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+				 relation->rd_rel->relam == BTREE_AM_OID), \
+	((relation)->rd_options ? \
+	 ((BTOptions *) (relation)->rd_options)->deduplicate_items : true))
 
 /*
  * Constant definition for progress reporting.  Phase numbers must match
@@ -752,6 +1038,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
 extern void _bt_parallel_done(IndexScanDesc scan);
 extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+							   IndexTuple newitem, Size newitemsz,
+							   bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+									OffsetNumber baseoff);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Page newpage, BTDedupState state);
+extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
+								   int nhtids);
+extern void _bt_update_posting(BTVacuumPosting vacposting);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+								   int postingoff);
+
 /*
  * prototypes for functions in nbtinsert.c
  */
@@ -770,14 +1072,16 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 /*
  * prototypes for functions in nbtpage.c
  */
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+							 bool allequalimage);
 extern void _bt_update_meta_cleanup_info(Relation rel,
 										 TransactionId oldestBtpoXact, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
-extern bool _bt_heapkeyspace(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
+							bool *allequalimage);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -786,7 +1090,8 @@ extern void _bt_relbuf(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
 extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
-								OffsetNumber *deletable, int ndeletable);
+								OffsetNumber *deletable, int ndeletable,
+								BTVacuumPosting *updatable, int nupdatable);
 extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								Relation heapRel);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 776a9bd723..347976c532 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_INSERT_POST	0x50	/* add index tuple with posting split */
+#define XLOG_BTREE_DEDUP		0x60	/* deduplicate tuples for a page */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -53,21 +54,34 @@ typedef struct xl_btree_metadata
 	uint32		fastlevel;
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
+	bool		allequalimage;
 } xl_btree_metadata;
 
 /*
  * This is what we need to know about simple (without split) insert.
  *
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST.  Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it must be a leaf
+ * page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case).  A split offset for
+ * the posting list is logged before the original new item.  Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step.  Also, recovery generates a "final"
+ * newitem.  See _bt_swap_posting() for details on posting list splits.
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+
+	/* POSTING SPLIT OFFSET FOLLOWS (INSERT_POST case) */
+	/* NEW TUPLE ALWAYS FOLLOWS AT THE END */
 } xl_btree_insert;
 
 #define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -92,8 +106,37 @@ typedef struct xl_btree_insert
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * _R variant split records generally do not have a newitem (_R variant leaf
+ * page split records that must deal with a posting list split will include an
+ * explicit newitem, though it is never used on the right page -- it is
+ * actually an orignewitem needed to update existing posting list).  The new
+ * high key of the left/original page appears last of all (and must always be
+ * present).
+ *
+ * Page split records that need the REDO routine to deal with a posting list
+ * split directly will have an explicit newitem, which is actually an
+ * orignewitem (the newitem as it was before the posting list split, not
+ * after).  A posting list split always has a newitem that comes immediately
+ * after the posting list being split (which would have overlapped with
+ * orignewitem prior to split).  Usually REDO must deal with posting list
+ * splits with an _L variant page split record, and usually both the new
+ * posting list and the final newitem go on the left page (the existing
+ * posting list will be inserted instead of the old, and the final newitem
+ * will be inserted next to that).  However, _R variant split records will
+ * include an orignewitem when the split point for the page happens to have a
+ * lastleft tuple that is also the posting list being split (leaving newitem
+ * as the page split's firstright tuple).  The existence of this corner case
+ * does not change the basic fact about newitem/orignewitem for the REDO
+ * routine: it is always state used for the left page alone.  (This is why the
+ * record's postingoff field isn't a reliable indicator of whether or not a
+ * posting list split occurred during the page split; a non-zero value merely
+ * indicates that the REDO routine must reconstruct a new posting list tuple
+ * that is needed for the left page.)
+ *
+ * This posting list split handling is equivalent to the xl_btree_insert REDO
+ * routine's INSERT_POST handling.  While the details are more complicated
+ * here, the concept and goals are exactly the same.  See _bt_swap_posting()
+ * for details on posting list splits.
  *
  * Backup Blk 1: new right page
  *
@@ -111,15 +154,33 @@ typedef struct xl_btree_split
 {
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
-	OffsetNumber newitemoff;	/* new item's offset (useful for _L variant) */
+	OffsetNumber newitemoff;	/* new item's offset */
+	uint16		postingoff;		/* offset inside orig posting tuple */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents a deduplication pass for a leaf page.  An array
+ * of BTDedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+	uint16		nintervals;
+
+	/* DEDUPLICATION INTERVALS FOLLOW */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup 	(offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
  * The WAL record can represent deletion of any number of index tuples on a
- * single index page when *not* executed by VACUUM.
+ * single index page when *not* executed by VACUUM.  Deletion of a subset of
+ * the TIDs within a posting list tuple is not supported.
  *
  * Backup Blk 0: index page
  */
@@ -150,21 +211,43 @@ typedef struct xl_btree_reuse_page
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
 
 /*
- * This is what we need to know about vacuum of individual leaf index tuples.
- * The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
+ * This is what we need to know about which TIDs to remove from an individual
+ * posting list tuple during vacuuming.  An array of these may appear at the
+ * end of xl_btree_vacuum records.
+ */
+typedef struct xl_btree_update
+{
+	uint16		ndeletedtids;
+
+	/* POSTING LIST uint16 OFFSETS TO A DELETED TID FOLLOW */
+} xl_btree_update;
+
+#define SizeOfBtreeUpdate	(offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
+
+/*
+ * This is what we need to know about a VACUUM of a leaf page.  The WAL record
+ * can represent deletion of any number of index tuples on a single index page
+ * when executed by VACUUM.  It can also support "updates" of index tuples,
+ * which is how deletes of a subset of TIDs contained in an existing posting
+ * list tuple are implemented. (Updates are only used when there will be some
+ * remaining TIDs once VACUUM finishes; otherwise the posting list tuple can
+ * just be deleted).
  *
- * Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * Updated posting list tuples are represented using xl_btree_update metadata.
+ * The REDO routine uses each xl_btree_update (plus its corresponding original
+ * index tuple from the target leaf page) to generate the final updated tuple.
  */
 typedef struct xl_btree_vacuum
 {
-	uint32		ndeleted;
+	uint16		ndeleted;
+	uint16		nupdated;
 
 	/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+	/* UPDATED TUPLES METADATA ARRAY FOLLOWS */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
@@ -245,6 +328,8 @@ typedef struct xl_btree_newroot
 extern void btree_redo(XLogReaderState *record);
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
 extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c88dccfb8d..6c15df7e70 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..f2b03a6cfc 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"deduplicate_items",
+			"Enables deduplication on btree index leaf pages",
+			RELOPT_KIND_BTREE,
+			ShareUpdateExclusiveLock
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c16eb05416..dfba5ae39a 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
 /*
  * Get the latestRemovedXid from the table entries pointed at by the index
  * tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
  */
 TransactionId
 index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	nbtcompare.o \
+	nbtdedup.o \
 	nbtinsert.o \
 	nbtpage.o \
 	nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index c60a4d0d9e..6499f5adb7 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every table TID within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -726,6 +729,134 @@ if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
+The overall effect is that leaf page splits gracefully adapt to inserts of
+large groups of duplicates, maximizing space utilization.  Note also that
+"trapping" large groups of duplicates on the same leaf page like this makes
+deduplication more efficient.  Deduplication can be performed infrequently,
+without merging together existing posting list tuples too often.
+
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid (or at least delay) page splits.  Note that the
+goals for deduplication in unique indexes are rather different; see later
+section for details.  Deduplication alters the physical representation of
+tuples without changing the logical contents of the index, and without
+adding overhead to read queries.  Non-pivot tuples are merged together
+into a single physical tuple with a posting list (a simple array of heap
+TIDs with the standard item pointer format).  Deduplication is always
+applied lazily, at the point where it would otherwise be necessary to
+perform a page split.  It occurs only when LP_DEAD items have been
+removed, as our last line of defense against splitting a leaf page.  We
+can set the LP_DEAD bit with posting list tuples, though only when all
+TIDs are known dead.
+
+Our lazy approach to deduplication allows the page space accounting used
+during page splits to have absolutely minimal special case logic for
+posting lists.  Posting lists can be thought of as extra payload that
+suffix truncation will reliably truncate away as needed during page
+splits, just like non-key columns from an INCLUDE index tuple.
+Incoming/new tuples can generally be treated as non-overlapping plain
+items (though see section on posting list splits for information about how
+overlapping new/incoming items are really handled).
+
+The representation of posting lists is almost identical to the posting
+lists used by GIN, so it would be straightforward to apply GIN's varbyte
+encoding compression scheme to individual posting lists.  Posting list
+compression would break the assumptions made by posting list splits about
+page space accounting (see later section), so it's not clear how
+compression could be integrated with nbtree.  Besides, posting list
+compression does not offer a compelling trade-off for nbtree, since in
+general nbtree is optimized for consistent performance with many
+concurrent readers and writers.
+
+A major goal of our lazy approach to deduplication is to limit the
+performance impact of deduplication with random updates.  Even concurrent
+append-only inserts of the same key value will tend to have inserts of
+individual index tuples in an order that doesn't quite match heap TID
+order.  Delaying deduplication minimizes page level fragmentation.
+
+Deduplication in unique indexes
+-------------------------------
+
+Very often, the range of values that can be placed on a given leaf page in
+a unique index is fixed and permanent.  For example, a primary key on an
+identity column will usually only have page splits caused by the insertion
+of new logical rows within the rightmost leaf page.  If there is a split
+of a non-rightmost leaf page, then the split must have been triggered by
+inserts associated with an UPDATE of an existing logical row.  Splitting a
+leaf page purely to store multiple versions should be considered
+pathological, since it permanently degrades the index structure in order
+to absorb a temporary burst of duplicates.  Deduplication in unique
+indexes helps to prevent these pathological page splits.  Storing
+duplicates in a space efficient manner is not the goal, since in the long
+run there won't be any duplicates anyway.  Rather, we're buying time for
+standard garbage collection mechanisms to run before a page split is
+needed.
+
+Unique index leaf pages only get a deduplication pass when an insertion
+(that might have to split the page) observed an existing duplicate on the
+page in passing.  This is based on the assumption that deduplication will
+only work out when _all_ new insertions are duplicates from UPDATEs.  This
+may mean that we miss an opportunity to delay a page split, but that's
+okay because our ultimate goal is to delay leaf page splits _indefinitely_
+(i.e. to prevent them altogether).  There is little point in trying to
+delay a split that is probably inevitable anyway.  This allows us to avoid
+the overhead of attempting to deduplicate with unique indexes that always
+have few or no duplicates.
+
+Posting list splits
+-------------------
+
+When the incoming tuple happens to overlap with an existing posting list,
+a posting list split is performed.  Like a page split, a posting list
+split resolves a situation where a new/incoming item "won't fit", while
+inserting the incoming item in passing (i.e. as part of the same atomic
+action).  It's possible (though not particularly likely) that an insert of
+a new item on to an almost-full page will overlap with a posting list,
+resulting in both a posting list split and a page split.  Even then, the
+atomic action that splits the posting list also inserts the new item
+(since page splits always insert the new item in passing).  Including the
+posting list split in the same atomic action as the insert avoids problems
+caused by concurrent inserts into the same posting list --  the exact
+details of how we change the posting list depend upon the new item, and
+vice-versa.  A single atomic action also minimizes the volume of extra
+WAL required for a posting list split, since we don't have to explicitly
+WAL-log the original posting list tuple.
+
+Despite piggy-backing on the same atomic action that inserts a new tuple,
+posting list splits can be thought of as a separate, extra action to the
+insert itself (or to the page split itself).  Posting list splits
+conceptually "rewrite" an insert that overlaps with an existing posting
+list into an insert that adds its final new item just to the right of the
+posting list instead.  The size of the posting list won't change, and so
+page space accounting code does not need to care about posting list splits
+at all.  This is an important upside of our design; the page split point
+choice logic is very subtle even without it needing to deal with posting
+list splits.
+
+Only a few isolated extra steps are required to preserve the illusion that
+the new item never overlapped with an existing posting list in the first
+place: the heap TID of the incoming tuple is swapped with the rightmost/max
+heap TID from the existing/originally overlapping posting list.  Also, the
+posting-split-with-page-split case must generate a new high key based on
+an imaginary version of the original page that has both the final new item
+and the after-list-split posting tuple (page splits usually just operate
+against an imaginary version that contains the new item/item that won't
+fit).
+
+This approach avoids inventing an "eager" atomic posting split operation
+that splits the posting list without simultaneously finishing the insert
+of the incoming item.  This alternative design might seem cleaner, but it
+creates subtle problems for page space accounting.  In general, there
+might not be enough free space on the page to split a posting list such
+that the incoming/new item no longer overlaps with either posting list
+half --- the operation could fail before the actual retail insert of the
+new item even begins.  We'd end up having to handle posting list splits
+that need a page split anyway.  Besides, supporting variable "split points"
+while splitting posting lists won't actually improve overall space
+utilization.
 
 Notes About Data Representation
 -------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..1ef73ddf70
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,830 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ *	  Deduplicate items in Postgres btrees.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
+							 OffsetNumber minoff, IndexTuple newitem);
+static void _bt_singleval_fillfactor(Page page, BTDedupState state,
+									 Size newitemsz);
+#ifdef USE_ASSERT_CHECKING
+static bool _bt_posting_valid(IndexTuple posting);
+#endif
+
+/*
+ * Deduplicate items on a leaf page.  The page will have to be split by caller
+ * if we cannot successfully free at least newitemsz (we also need space for
+ * newitem's line pointer, which isn't included in caller's newitemsz).
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible.  Note, however, that "single
+ * value" strategy is sometimes used for !checkingunique callers, in which
+ * case deduplication will leave a few tuples untouched at the end of the
+ * page.  The general idea is to prepare the page for an anticipated page
+ * split that uses nbtsplitloc.c's "single value" strategy to determine a
+ * split point.  (There is no reason to deduplicate items that will end up on
+ * the right half of the page after the anticipated page split; better to
+ * handle those if and when the anticipated right half page gets its own
+ * deduplication pass, following further inserts of duplicates.)
+ *
+ * This function should be called during insertion, when the page doesn't have
+ * enough space to fit an incoming newitem.  If the BTP_HAS_GARBAGE page flag
+ * was set, caller should have removed any LP_DEAD items by calling
+ * _bt_vacuum_one_page() before calling here.  We may still have to kill
+ * LP_DEAD items here when the page's BTP_HAS_GARBAGE hint is falsely unset,
+ * but that should be rare.  Also, _bt_vacuum_one_page() won't unset the
+ * BTP_HAS_GARBAGE flag when it finds no LP_DEAD items, so a successful
+ * deduplication pass will always clear it, just to keep things tidy.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
+				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque;
+	Page		newpage;
+	int			newpagendataitems = 0;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	BTDedupState state;
+	int			ndeletable = 0;
+	Size		pagesaving = 0;
+	bool		singlevalstrat = false;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+
+	/*
+	 * We can't assume that there are no LP_DEAD items.  For one thing, VACUUM
+	 * will clear the BTP_HAS_GARBAGE hint without reliably removing items
+	 * that are marked LP_DEAD.  We don't want to unnecessarily unset LP_DEAD
+	 * bits when deduplicating items.  Allowing it would be correct, though
+	 * wasteful.
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+		/*
+		 * Return when a split will be avoided.  This is equivalent to
+		 * avoiding a split using the usual _bt_vacuum_one_page() path.
+		 */
+		if (PageGetFreeSpace(page) >= newitemsz)
+			return;
+
+		/*
+		 * Reconsider number of items on page, in case _bt_delitems_delete()
+		 * managed to delete an item or two
+		 */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+
+	/*
+	 * By here, it's clear that deduplication will definitely be attempted.
+	 * Initialize deduplication state.
+	 *
+	 * It would be possible for maxpostingsize (limit on posting list tuple
+	 * size) to be set to one third of the page.  However, it seems like a
+	 * good idea to limit the size of posting lists to one sixth of a page.
+	 * That ought to leave us with a good split point when pages full of
+	 * duplicates can be split several times.
+	 */
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->deduplicate = true;
+	state->maxpostingsize = Min(BTMaxItemSize(page) / 2, INDEX_SIZE_MASK);
+	/* Metadata about base tuple of current pending posting list */
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	/* Metadata about current pending posting list TIDs */
+	state->htids = palloc(state->maxpostingsize);
+	state->nhtids = 0;
+	state->nitems = 0;
+	/* Size of all physical tuples to be replaced by pending posting list */
+	state->phystupsize = 0;
+	/* nintervals should be initialized to zero */
+	state->nintervals = 0;
+
+	/* Determine if "single value" strategy should be used */
+	if (!checkingunique)
+		singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
+
+	/*
+	 * Deduplicate items from page, and write them to newpage.
+	 *
+	 * Copy the original page's LSN into newpage copy.  This will become the
+	 * updated version of the page.  We need this because XLogInsert will
+	 * examine the LSN and possibly dump it in a page image.
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	PageSetLSN(newpage, PageGetLSN(page));
+
+	/* Copy high key, if any */
+	if (!P_RIGHTMOST(opaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (offnum == minoff)
+		{
+			/*
+			 * No previous/base tuple for the data item -- use the data item
+			 * as base tuple of pending posting list
+			 */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (state->deduplicate &&
+				 _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+			/*
+			 * Tuple is equal to base tuple of pending posting list.  Heap
+			 * TID(s) for itup have been saved in state.
+			 */
+		}
+		else
+		{
+			/*
+			 * Tuple is not equal to pending posting list tuple, or
+			 * _bt_dedup_save_htid() opted to not merge current item into
+			 * pending posting list for some other reason (e.g., adding more
+			 * TIDs would have caused posting list to exceed current
+			 * maxpostingsize).
+			 *
+			 * If state contains pending posting list with more than one item,
+			 * form new posting tuple, and actually update the page.  Else
+			 * reset the state and move on without modifying the page.
+			 */
+			pagesaving += _bt_dedup_finish_pending(newpage, state);
+			newpagendataitems++;
+
+			if (singlevalstrat)
+			{
+				/*
+				 * Single value strategy's extra steps.
+				 *
+				 * Lower maxpostingsize for sixth and final item that might be
+				 * deduplicated by current deduplication pass.  When sixth
+				 * item formed/observed, stop deduplicating items.
+				 *
+				 * Note: It's possible that this will be reached even when
+				 * current deduplication pass has yet to merge together some
+				 * existing items.  It doesn't matter whether or not the
+				 * current call generated the maxpostingsize-capped duplicate
+				 * tuples at the start of the page.
+				 */
+				if (newpagendataitems == 5)
+					_bt_singleval_fillfactor(page, state, newitemsz);
+				else if (newpagendataitems == 6)
+				{
+					state->deduplicate = false;
+					singlevalstrat = false; /* won't be back here */
+				}
+			}
+
+			/* itup starts new pending posting list */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+	}
+
+	/* Handle the last item */
+	pagesaving += _bt_dedup_finish_pending(newpage, state);
+	newpagendataitems++;
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 *
+	 * We could determine whether or not to proceed on the basis the space
+	 * savings being sufficient to avoid an immediate page split instead.  We
+	 * don't do that because there is some small value in nbtsplitloc.c always
+	 * operating against a page that is fully deduplicated (apart from
+	 * newitem).  Besides, most of the cost has already been paid.
+	 */
+	if (state->nintervals == 0)
+	{
+		/* cannot leak memory here */
+		pfree(newpage);
+		pfree(state->htids);
+		pfree(state);
+		return;
+	}
+
+	/*
+	 * By here, it's clear that deduplication will definitely go ahead.
+	 *
+	 * Clear the BTP_HAS_GARBAGE page flag in the unlikely event that it is
+	 * still falsely set, just to keep things tidy.  (We can't rely on
+	 * _bt_vacuum_one_page() having done this already, and we can't rely on a
+	 * page split or VACUUM getting to it in the near future.)
+	 */
+	if (P_HAS_GARBAGE(opaque))
+	{
+		BTPageOpaque nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.nintervals = state->nintervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/*
+		 * The intervals array is not in the buffer, but pretend that it is.
+		 * When XLogInsert stores the whole buffer, the array need not be
+		 * stored too.
+		 */
+		XLogRegisterBufData(0, (char *) state->intervals,
+							state->nintervals * sizeof(BTDedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* Local space accounting should agree with page accounting */
+	Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+	/* cannot leak memory here */
+	pfree(state->htids);
+	pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's base tuple.
+ *
+ * Every tuple processed by deduplication either becomes the base tuple for a
+ * posting list, or gets its heap TID(s) accepted into a pending posting list.
+ * A tuple that starts out as the base tuple for a posting list will only
+ * actually be rewritten within _bt_dedup_finish_pending() when it turns out
+ * that there are duplicates that can be merged into the base tuple.
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+						OffsetNumber baseoff)
+{
+	Assert(state->nhtids == 0);
+	Assert(state->nitems == 0);
+	Assert(!BTreeTupleIsPivot(base));
+
+	/*
+	 * Copy heap TID(s) from new base tuple for new candidate posting list
+	 * into working state's array
+	 */
+	if (!BTreeTupleIsPosting(base))
+	{
+		memcpy(state->htids, &base->t_tid, sizeof(ItemPointerData));
+		state->nhtids = 1;
+		state->basetupsize = IndexTupleSize(base);
+	}
+	else
+	{
+		int			nposting;
+
+		nposting = BTreeTupleGetNPosting(base);
+		memcpy(state->htids, BTreeTupleGetPosting(base),
+			   sizeof(ItemPointerData) * nposting);
+		state->nhtids = nposting;
+		/* basetupsize should not include existing posting list */
+		state->basetupsize = BTreeTupleGetPostingOffset(base);
+	}
+
+	/*
+	 * Save new base tuple itself -- it'll be needed if we actually create a
+	 * new posting list from new pending posting list.
+	 *
+	 * Must maintain physical size of all existing tuples (including line
+	 * pointer overhead) so that we can calculate space savings on page.
+	 */
+	state->nitems = 1;
+	state->base = base;
+	state->baseoff = baseoff;
+	state->phystupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+	/* Also save baseoff in pending state for interval */
+	state->intervals[state->nintervals].baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state now
+ * includes itup's heap TID(s).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+	int			nhtids;
+	ItemPointer htids;
+	Size		mergedtupsz;
+
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (!BTreeTupleIsPosting(itup))
+	{
+		nhtids = 1;
+		htids = &itup->t_tid;
+	}
+	else
+	{
+		nhtids = BTreeTupleGetNPosting(itup);
+		htids = BTreeTupleGetPosting(itup);
+	}
+
+	/*
+	 * Don't append (have caller finish pending posting list as-is) if
+	 * appending heap TID(s) from itup would put us over maxpostingsize limit.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.
+	 */
+	mergedtupsz = MAXALIGN(state->basetupsize +
+						   (state->nhtids + nhtids) * sizeof(ItemPointerData));
+
+	if (mergedtupsz > state->maxpostingsize)
+		return false;
+
+	/*
+	 * Save heap TIDs to pending posting list tuple -- itup can be merged into
+	 * pending posting list
+	 */
+	state->nitems++;
+	memcpy(state->htids + state->nhtids, htids,
+		   sizeof(ItemPointerData) * nhtids);
+	state->nhtids += nhtids;
+	state->phystupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+	return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead.  This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Page newpage, BTDedupState state)
+{
+	OffsetNumber tupoff;
+	Size		tuplesz;
+	Size		spacesaving;
+
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+	tupoff = OffsetNumberNext(PageGetMaxOffsetNumber(newpage));
+	if (state->nitems == 1)
+	{
+		/* Use original, unchanged base tuple */
+		tuplesz = IndexTupleSize(state->base);
+		if (PageAddItem(newpage, (Item) state->base, tuplesz, tupoff,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		spacesaving = 0;
+	}
+	else
+	{
+		IndexTuple	final;
+
+		/* Form a tuple with a posting list */
+		final = _bt_form_posting(state->base, state->htids, state->nhtids);
+		tuplesz = IndexTupleSize(final);
+		Assert(tuplesz <= state->maxpostingsize);
+
+		/* Save final number of items for posting list */
+		state->intervals[state->nintervals].nitems = state->nitems;
+
+		Assert(tuplesz == MAXALIGN(IndexTupleSize(final)));
+		if (PageAddItem(newpage, (Item) final, tuplesz, tupoff, false,
+						false) == InvalidOffsetNumber)
+			elog(ERROR, "deduplication failed to add tuple to page");
+
+		pfree(final);
+		spacesaving = state->phystupsize - (tuplesz + sizeof(ItemIdData));
+		/* Increment nintervals, since we wrote a new posting list tuple */
+		state->nintervals++;
+		Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+
+	return spacesaving;
+}
+
+/*
+ * Determine if page non-pivot tuples (data items) are all duplicates of the
+ * same value -- if they are, deduplication's "single value" strategy should
+ * be applied.  The general goal of this strategy is to ensure that
+ * nbtsplitloc.c (which uses its own single value strategy) will find a useful
+ * split point as further duplicates are inserted, and successive rightmost
+ * page splits occur among pages that store the same duplicate value.  When
+ * the page finally splits, it should end up BTREE_SINGLEVAL_FILLFACTOR% full,
+ * just like it would if deduplication were disabled.
+ *
+ * We expect that affected workloads will require _several_ single value
+ * strategy deduplication passes (over a page that only stores duplicates)
+ * before the page is finally split.  The first deduplication pass should only
+ * find regular non-pivot tuples.  Later deduplication passes will find
+ * existing maxpostingsize-capped posting list tuples, which must be skipped
+ * over.  The penultimate pass is generally the first pass that actually
+ * reaches _bt_singleval_fillfactor(), and so will deliberately leave behind a
+ * few untouched non-pivot tuples.  The final deduplication pass won't free
+ * any space -- it will skip over everything without merging anything (it
+ * retraces the steps of the penultimate pass).
+ *
+ * Fortunately, having several passes isn't too expensive.  Each pass (after
+ * the first pass) won't spend many cycles on the large posting list tuples
+ * left by previous passes.  Each pass will find a large contiguous group of
+ * smaller duplicate tuples to merge together at the end of the page.
+ *
+ * Note: We deliberately don't bother checking if the high key is a distinct
+ * value (prior to the TID tiebreaker column) before proceeding, unlike
+ * nbtsplitloc.c.  Its single value strategy only gets applied on the
+ * rightmost page of duplicates of the same value (other leaf pages full of
+ * duplicates will get a simple 50:50 page split instead of splitting towards
+ * the end of the page).  There is little point in making the same distinction
+ * here.
+ */
+static bool
+_bt_do_singleval(Relation rel, Page page, BTDedupState state,
+				 OffsetNumber minoff, IndexTuple newitem)
+{
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	ItemId		itemid;
+	IndexTuple	itup;
+
+	itemid = PageGetItemId(page, minoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+
+	if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+	{
+		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Lower maxpostingsize when using "single value" strategy, to avoid a sixth
+ * and final maxpostingsize-capped tuple.  The sixth and final posting list
+ * tuple will end up somewhat smaller than the first five.  (Note: The first
+ * five tuples could actually just be very large duplicate tuples that
+ * couldn't be merged together at all.  Deduplication will simply not modify
+ * the page when that happens.)
+ *
+ * When there are six posting lists on the page (after current deduplication
+ * pass goes on to create/observe a sixth very large tuple), caller should end
+ * its deduplication pass.  It isn't useful to try to deduplicate items that
+ * are supposed to end up on the new right sibling page following the
+ * anticipated page split.  A future deduplication pass of future right
+ * sibling page might take care of it.  (This is why the first single value
+ * strategy deduplication pass for a given leaf page will generally find only
+ * plain non-pivot tuples -- see _bt_do_singleval() comments.)
+ */
+static void
+_bt_singleval_fillfactor(Page page, BTDedupState state, Size newitemsz)
+{
+	Size		leftfree;
+	int			reduction;
+
+	/* This calculation needs to match nbtsplitloc.c */
+	leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+	/* Subtract size of new high key (includes pivot heap TID space) */
+	leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+	/*
+	 * Reduce maxpostingsize by an amount equal to target free space on left
+	 * half of page
+	 */
+	reduction = leftfree * ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+	if (state->maxpostingsize > reduction)
+		state->maxpostingsize -= reduction;
+	else
+		state->maxpostingsize = 0;
+}
+
+/*
+ * Build a posting list tuple based on caller's "base" index tuple and list of
+ * heap TIDs.  When nhtids == 1, builds a standard non-pivot tuple without a
+ * posting list. (Posting list tuples can never have a single heap TID, partly
+ * because that ensures that deduplication always reduces final MAXALIGN()'d
+ * size of entire tuple.)
+ *
+ * Convention is that posting list starts at a MAXALIGN()'d offset (rather
+ * than a SHORTALIGN()'d offset), in line with the approach taken when
+ * appending a heap TID to new pivot tuple/high key during suffix truncation.
+ * This sometimes wastes a little space that was only needed as alignment
+ * padding in the original tuple.  Following this convention simplifies the
+ * space accounting used when deduplicating a page (the same convention
+ * simplifies the accounting for choosing a point to split a page at).
+ *
+ * Note: Caller's "htids" array must be unique and already in ascending TID
+ * order.  Any existing heap TIDs from "base" won't automatically appear in
+ * returned posting list tuple (they must be included in htids array.)
+ */
+IndexTuple
+_bt_form_posting(IndexTuple base, ItemPointer htids, int nhtids)
+{
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+
+	if (BTreeTupleIsPosting(base))
+		keysize = BTreeTupleGetPostingOffset(base);
+	else
+		keysize = IndexTupleSize(base);
+
+	Assert(!BTreeTupleIsPivot(base));
+	Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+	Assert(keysize == MAXALIGN(keysize));
+
+	/* Determine final size of new tuple */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	Assert(newsize <= INDEX_SIZE_MASK);
+	Assert(newsize == MAXALIGN(newsize));
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, base, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		memcpy(BTreeTupleGetPosting(itup), htids,
+			   sizeof(ItemPointerData) * nhtids);
+		Assert(_bt_posting_valid(itup));
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(htids, &itup->t_tid);
+		Assert(ItemPointerIsValid(&itup->t_tid));
+	}
+
+	return itup;
+}
+
+/*
+ * Generate a replacement tuple by "updating" a posting list tuple so that it
+ * no longer has TIDs that need to be deleted.
+ *
+ * Used by VACUUM.  Caller's vacposting argument points to the existing
+ * posting list tuple to be updated.
+ *
+ * On return, caller's vacposting argument will point to final "updated"
+ * tuple, which will be palloc()'d in caller's memory context.
+ */
+void
+_bt_update_posting(BTVacuumPosting vacposting)
+{
+	IndexTuple	origtuple = vacposting->itup;
+	uint32		keysize,
+				newsize;
+	IndexTuple	itup;
+	int			nhtids;
+	int			ui,
+				d;
+	ItemPointer htids;
+
+	nhtids = BTreeTupleGetNPosting(origtuple) - vacposting->ndeletedtids;
+
+	Assert(_bt_posting_valid(origtuple));
+	Assert(nhtids > 0 && nhtids < BTreeTupleGetNPosting(origtuple));
+
+	if (BTreeTupleIsPosting(origtuple))
+		keysize = BTreeTupleGetPostingOffset(origtuple);
+	else
+		keysize = IndexTupleSize(origtuple);
+
+	/*
+	 * Determine final size of new tuple.
+	 *
+	 * This calculation needs to match the code used within _bt_form_posting()
+	 * for new posting list tuples.  We avoid calling _bt_form_posting() here
+	 * to save ourselves a second memory allocation for a htids workspace.
+	 */
+	if (nhtids > 1)
+		newsize = MAXALIGN(keysize +
+						   nhtids * sizeof(ItemPointerData));
+	else
+		newsize = keysize;
+
+	/* Allocate memory using palloc0() (matches index_form_tuple()) */
+	itup = palloc0(newsize);
+	memcpy(itup, origtuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nhtids > 1)
+	{
+		/* Form posting list tuple */
+		BTreeTupleSetPosting(itup, nhtids, keysize);
+		htids = BTreeTupleGetPosting(itup);
+	}
+	else
+	{
+		/* Form standard non-pivot tuple */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		htids = &itup->t_tid;
+	}
+
+	ui = 0;
+	d = 0;
+	for (int i = 0; i < BTreeTupleGetNPosting(origtuple); i++)
+	{
+		if (d < vacposting->ndeletedtids && vacposting->deletetids[d] == i)
+		{
+			d++;
+			continue;
+		}
+		htids[ui++] = *BTreeTupleGetPostingN(origtuple, i);
+	}
+	Assert(ui == nhtids);
+	Assert(d == vacposting->ndeletedtids);
+	Assert(nhtids == 1 || _bt_posting_valid(itup));
+
+	/* vacposting arg's itup will now point to updated version */
+	vacposting->itup = itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').  Modifies newitem, so caller should pass their own private
+ * copy that can safely be modified.
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'.  Modified newitem is
+ * what caller actually inserts. (This happens inside the same critical
+ * section that performs an in-place update of old posting list using new
+ * posting list returned here.)
+ *
+ * While the keys from newitem and oposting must be opclass equal, and must
+ * generate identical output when run through the underlying type's output
+ * function, it doesn't follow that their representations match exactly.
+ * Caller must avoid assuming that there can't be representational differences
+ * that make datums from oposting bigger or smaller than the corresponding
+ * datums from newitem.  For example, differences in TOAST input state might
+ * break a faulty assumption about tuple size (the executor is entitled to
+ * apply TOAST compression based on its own criteria).  It also seems possible
+ * that further representational variation will be introduced in the future,
+ * in order to support nbtree features like page-level prefix compression.
+ *
+ * See nbtree/README for details on the design of posting list splits.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+	int			nhtids;
+	char	   *replacepos;
+	char	   *replaceposright;
+	Size		nmovebytes;
+	IndexTuple	nposting;
+
+	nhtids = BTreeTupleGetNPosting(oposting);
+	Assert(_bt_posting_valid(oposting));
+	Assert(postingoff > 0 && postingoff < nhtids);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID.  We shift TIDs one place to the right, losing original
+	 * rightmost TID. (nmovebytes must not include TIDs to the left of
+	 * postingoff, nor the existing rightmost/max TID that gets overwritten.)
+	 */
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+	replaceposright = (char *) BTreeTupleGetPostingN(nposting, postingoff + 1);
+	nmovebytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+	memmove(replaceposright, replacepos, nmovebytes);
+
+	/* Fill the gap at postingoff with TID of new item (original new TID) */
+	Assert(!BTreeTupleIsPivot(newitem) && !BTreeTupleIsPosting(newitem));
+	ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+	/* Now copy oposting's rightmost/max TID into new item (final new TID) */
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(oposting), &newitem->t_tid);
+
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+							  BTreeTupleGetHeapTID(newitem)) < 0);
+	Assert(_bt_posting_valid(nposting));
+
+	return nposting;
+}
+
+/*
+ * Verify posting list invariants for "posting", which must be a posting list
+ * tuple.  Used within assertions.
+ */
+#ifdef USE_ASSERT_CHECKING
+static bool
+_bt_posting_valid(IndexTuple posting)
+{
+	ItemPointerData last;
+	ItemPointer htid;
+
+	if (!BTreeTupleIsPosting(posting) || BTreeTupleGetNPosting(posting) < 2)
+		return false;
+
+	/* Remember first heap TID for loop */
+	ItemPointerCopy(BTreeTupleGetHeapTID(posting), &last);
+	if (!ItemPointerIsValid(&last))
+		return false;
+
+	/* Iterate, starting from second TID */
+	for (int i = 1; i < BTreeTupleGetNPosting(posting); i++)
+	{
+		htid = BTreeTupleGetPostingN(posting, i);
+
+		if (!ItemPointerIsValid(htid))
+			return false;
+		if (ItemPointerCompare(htid, &last) <= 0)
+			return false;
+		ItemPointerCopy(htid, &last);
+	}
+
+	return true;
+}
+#endif
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4e5849ab8e..b913543221 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int postingoff,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple orignewitem,
+						IndexTuple nposting, uint16 postingoff);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -125,6 +127,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	insertstate.itup_key = itup_key;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
+	insertstate.postingoff = 0;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -295,7 +298,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.postingoff, false);
 	}
 	else
 	{
@@ -340,6 +343,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 uint32 *speculativeToken)
 {
 	IndexTuple	itup = insertstate->itup;
+	IndexTuple	curitup;
+	ItemId		curitemid;
 	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
 	OffsetNumber offset;
@@ -348,6 +353,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	BTPageOpaque opaque;
 	Buffer		nbuf = InvalidBuffer;
 	bool		found = false;
+	bool		inposting = false;
+	bool		prevalldead = true;
+	int			curposti = 0;
 
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
@@ -375,13 +383,21 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
-		ItemId		curitemid;
-		IndexTuple	curitup;
-		BlockNumber nblkno;
-
 		/*
-		 * make sure the offset points to an actual item before trying to
-		 * examine it...
+		 * Each iteration of the loop processes one heap TID, not one index
+		 * tuple.  Current offset number for page isn't usually advanced on
+		 * iterations that process heap TIDs from posting list tuples.
+		 *
+		 * "inposting" state is set when _inside_ a posting list --- not when
+		 * we're at the start (or end) of a posting list.  We advance curposti
+		 * at the end of the iteration when inside a posting list tuple.  In
+		 * general, every loop iteration either advances the page offset or
+		 * advances curposti --- an iteration that handles the rightmost/max
+		 * heap TID in a posting list finally advances the page offset (and
+		 * unsets "inposting").
+		 *
+		 * Make sure the offset points to an actual index tuple before trying
+		 * to examine it...
 		 */
 		if (offset <= maxoff)
 		{
@@ -406,31 +422,60 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				break;
 			}
 
-			curitemid = PageGetItemId(page, offset);
-
 			/*
-			 * We can skip items that are marked killed.
+			 * We can skip items that are already marked killed.
 			 *
 			 * In the presence of heavy update activity an index may contain
 			 * many killed items with the same key; running _bt_compare() on
 			 * each killed item gets expensive.  Just advance over killed
 			 * items as quickly as we can.  We only apply _bt_compare() when
-			 * we get to a non-killed item.  Even those comparisons could be
-			 * avoided (in the common case where there is only one page to
-			 * visit) by reusing bounds, but just skipping dead items is fast
-			 * enough.
+			 * we get to a non-killed item.  We could reuse the bounds to
+			 * avoid _bt_compare() calls for known equal tuples, but it
+			 * doesn't seem worth it.  Workloads with heavy update activity
+			 * tend to have many deduplication passes, so we'll often avoid
+			 * most of those comparisons, too (we call _bt_compare() when the
+			 * posting list tuple is initially encountered, though not when
+			 * processing later TIDs from the same tuple).
 			 */
-			if (!ItemIdIsDead(curitemid))
+			if (!inposting)
+				curitemid = PageGetItemId(page, offset);
+			if (inposting || !ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
 				bool		all_dead;
 
-				if (_bt_compare(rel, itup_key, page, offset) != 0)
-					break;		/* we're past all the equal tuples */
+				if (!inposting)
+				{
+					/* Plain tuple, or first TID in posting list tuple */
+					if (_bt_compare(rel, itup_key, page, offset) != 0)
+						break;	/* we're past all the equal tuples */
 
-				/* okay, we gotta fetch the heap tuple ... */
-				curitup = (IndexTuple) PageGetItem(page, curitemid);
-				htid = curitup->t_tid;
+					/* Advanced curitup */
+					curitup = (IndexTuple) PageGetItem(page, curitemid);
+					Assert(!BTreeTupleIsPivot(curitup));
+				}
+
+				/* okay, we gotta fetch the heap tuple using htid ... */
+				if (!BTreeTupleIsPosting(curitup))
+				{
+					/* ... htid is from simple non-pivot tuple */
+					Assert(!inposting);
+					htid = curitup->t_tid;
+				}
+				else if (!inposting)
+				{
+					/* ... htid is first TID in new posting list */
+					inposting = true;
+					prevalldead = true;
+					curposti = 0;
+					htid = *BTreeTupleGetPostingN(curitup, 0);
+				}
+				else
+				{
+					/* ... htid is second or subsequent TID in posting list */
+					Assert(curposti > 0);
+					htid = *BTreeTupleGetPostingN(curitup, curposti);
+				}
 
 				/*
 				 * If we are doing a recheck, we expect to find the tuple we
@@ -506,8 +551,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * not part of this chain because it had a different index
 					 * entry.
 					 */
-					htid = itup->t_tid;
-					if (table_index_fetch_tuple_check(heapRel, &htid,
+					if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
 													  SnapshotSelf, NULL))
 					{
 						/* Normal case --- it's still live */
@@ -565,12 +609,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 													RelationGetRelationName(rel))));
 					}
 				}
-				else if (all_dead)
+				else if (all_dead && (!inposting ||
+									  (prevalldead &&
+									   curposti == BTreeTupleGetNPosting(curitup) - 1)))
 				{
 					/*
-					 * The conflicting tuple (or whole HOT chain) is dead to
-					 * everyone, so we may as well mark the index entry
-					 * killed.
+					 * The conflicting tuple (or all HOT chains pointed to by
+					 * all posting list TIDs) is dead to everyone, so mark the
+					 * index entry killed.
 					 */
 					ItemIdMarkDead(curitemid);
 					opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -584,14 +630,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					else
 						MarkBufferDirtyHint(insertstate->buf, true);
 				}
+
+				/*
+				 * Remember if posting list tuple has even a single HOT chain
+				 * whose members are not all dead
+				 */
+				if (!all_dead && inposting)
+					prevalldead = false;
 			}
 		}
 
-		/*
-		 * Advance to next tuple to continue checking.
-		 */
-		if (offset < maxoff)
+		if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+		{
+			/* Advance to next TID in same posting list */
+			curposti++;
+			continue;
+		}
+		else if (offset < maxoff)
+		{
+			/* Advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			offset = OffsetNumberNext(offset);
+		}
 		else
 		{
 			int			highkeycmp;
@@ -606,7 +667,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
 			{
-				nblkno = opaque->btpo_next;
+				BlockNumber nblkno = opaque->btpo_next;
+
 				nbuf = _bt_relandgetbuf(rel, nbuf, nblkno, BT_READ);
 				page = BufferGetPage(nbuf);
 				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -616,6 +678,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					elog(ERROR, "fell off the end of index \"%s\"",
 						 RelationGetRelationName(rel));
 			}
+			/* Will also advance to next tuple */
+			curposti = 0;
+			inposting = false;
 			maxoff = PageGetMaxOffsetNumber(page);
 			offset = P_FIRSTDATAKEY(opaque);
 			/* Don't invalidate binary search bounds */
@@ -684,6 +749,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber newitemoff;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -696,9 +762,13 @@ _bt_findinsertloc(Relation rel,
 	Assert(!insertstate->bounds_valid || checkingunique);
 	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
 	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+	Assert(!itup_key->allequalimage || itup_key->heapkeyspace);
 
 	if (itup_key->heapkeyspace)
 	{
+		/* Keep track of whether checkingunique duplicate seen */
+		bool		uniquedup = false;
+
 		/*
 		 * If we're inserting into a unique index, we may have to walk right
 		 * through leaf pages to find the one leaf page that we must insert on
@@ -715,6 +785,13 @@ _bt_findinsertloc(Relation rel,
 		 */
 		if (checkingunique)
 		{
+			if (insertstate->low < insertstate->stricthigh)
+			{
+				/* Encountered a duplicate in _bt_check_unique() */
+				Assert(insertstate->bounds_valid);
+				uniquedup = true;
+			}
+
 			for (;;)
 			{
 				/*
@@ -741,18 +818,43 @@ _bt_findinsertloc(Relation rel,
 				/* Update local state after stepping right */
 				page = BufferGetPage(insertstate->buf);
 				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				/* Assume duplicates (if checkingunique) */
+				uniquedup = true;
 			}
 		}
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that fails to free enough space, see if
+		 * we can avoid a page split by performing a deduplication pass over
+		 * the page.
+		 *
+		 * We only perform a deduplication pass for a checkingunique caller
+		 * when the incoming item is a duplicate of an existing item on the
+		 * leaf page.  This heuristic avoids wasting cycles -- we only expect
+		 * to benefit from deduplicating a unique index page when most or all
+		 * recently added items are duplicates.  See nbtree/README.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+
+				/* Might as well assume duplicates (if checkingunique) */
+				uniquedup = true;
+			}
+
+			if (itup_key->allequalimage && BTGetDeduplicateItems(rel) &&
+				(!checkingunique || uniquedup) &&
+				PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itup, insertstate->itemsz,
+								   checkingunique);
+				insertstate->bounds_valid = false;
+			}
 		}
 	}
 	else
@@ -834,7 +936,30 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	if (insertstate->postingoff == -1)
+	{
+		/*
+		 * There is an overlapping posting list tuple with its LP_DEAD bit
+		 * set.  We don't want to unnecessarily unset its LP_DEAD bit while
+		 * performing a posting list split, so delete all LP_DEAD items early.
+		 * This is the only case where LP_DEAD deletes happen even though
+		 * there is space for newitem on the page.
+		 */
+		_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+
+		/*
+		 * Do new binary search.  New insert location cannot overlap with any
+		 * posting list now.
+		 */
+		insertstate->bounds_valid = false;
+		insertstate->postingoff = 0;
+		newitemoff = _bt_binsrch_insert(rel, insertstate);
+		Assert(insertstate->postingoff == 0);
+	}
+
+	return newitemoff;
 }
 
 /*
@@ -900,10 +1025,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if postingoff != 0, splits existing posting list tuple
+ *			   (since it overlaps with new 'itup' tuple).
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (might be split from posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -931,11 +1058,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int postingoff,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	oposting;
+	IndexTuple	origitup = NULL;
+	IndexTuple	nposting = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -949,6 +1080,7 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -959,6 +1091,34 @@ _bt_insertonpg(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
+	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (postingoff != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple on a leaf page.  Prepare to
+		 * split an existing posting list.  Overwriting the posting list with
+		 * its post-split version is treated as an extra step in either the
+		 * insert or page split critical section.
+		 */
+		Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+		Assert(itup_key->heapkeyspace && itup_key->allequalimage);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* use a mutable copy of itup as our itup from here on */
+		origitup = itup;
+		itup = CopyIndexTuple(origitup);
+		nposting = _bt_swap_posting(itup, oposting, postingoff);
+		/* itup now contains rightmost/max TID from oposting */
+
+		/* Alter offset so that newitem goes after posting list */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
 	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
@@ -991,7 +1151,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 origitup, nposting, postingoff);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,6 +1227,9 @@ _bt_insertonpg(Relation rel,
 		/* Do the update.  No ereport(ERROR) until changes are logged */
 		START_CRIT_SECTION();
 
+		if (postingoff != 0)
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
 		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
@@ -1115,8 +1279,19 @@ _bt_insertonpg(Relation rel,
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
 
-			if (P_ISLEAF(lpageop))
+			if (P_ISLEAF(lpageop) && postingoff == 0)
+			{
+				/* Simple leaf insert */
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
+			}
+			else if (postingoff != 0)
+			{
+				/*
+				 * Leaf insert with posting list split.  Must include
+				 * postingoff field before newitem/orignewitem.
+				 */
+				xlinfo = XLOG_BTREE_INSERT_POST;
+			}
 			else
 			{
 				/*
@@ -1139,6 +1314,7 @@ _bt_insertonpg(Relation rel,
 				xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 				xlmeta.last_cleanup_num_heap_tuples =
 					metad->btm_last_cleanup_num_heap_tuples;
+				xlmeta.allequalimage = metad->btm_allequalimage;
 
 				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1147,7 +1323,27 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (postingoff == 0)
+			{
+				/* Simple, common case -- log itup from caller */
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			}
+			else
+			{
+				/*
+				 * Insert with posting list split (XLOG_BTREE_INSERT_POST
+				 * record) case.
+				 *
+				 * Log postingoff.  Also log origitup, not itup.  REDO routine
+				 * must reconstruct final itup (as well as nposting) using
+				 * _bt_swap_posting().
+				 */
+				uint16		upostingoff = postingoff;
+
+				XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+				XLogRegisterBufData(0, (char *) origitup,
+									IndexTupleSize(origitup));
+			}
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1189,6 +1385,14 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (postingoff != 0)
+	{
+		/* itup is actually a modified copy of caller's original */
+		pfree(nposting);
+		pfree(itup);
+	}
 }
 
 /*
@@ -1204,12 +1408,24 @@ _bt_insertonpg(Relation rel,
  *		This function will clear the INCOMPLETE_SPLIT flag on it, and
  *		release the buffer.
  *
+ *		orignewitem, nposting, and postingoff are needed when an insert of
+ *		orignewitem results in both a posting list split and a page split.
+ *		These extra posting list split details are used here in the same
+ *		way as they are used in the more common case where a posting list
+ *		split does not coincide with a page split.  We need to deal with
+ *		posting list splits directly in order to ensure that everything
+ *		that follows from the insert of orignewitem is handled as a single
+ *		atomic operation (though caller's insert of a new pivot/downlink
+ *		into parent page will still be a separate operation).  See
+ *		nbtree/README for details on the design of posting list splits.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1229,6 +1445,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber leftoff,
 				rightoff;
 	OffsetNumber firstright;
+	OffsetNumber origpagepostingoff;
 	OffsetNumber maxoff;
 	OffsetNumber i;
 	bool		newitemonleft,
@@ -1298,6 +1515,34 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	PageSetLSN(leftpage, PageGetLSN(origpage));
 	isleaf = P_ISLEAF(oopaque);
 
+	/*
+	 * Determine page offset number of existing overlapped-with-orignewitem
+	 * posting list when it is necessary to perform a posting list split in
+	 * passing.  Note that newitem was already changed by caller (newitem no
+	 * longer has the orignewitem TID).
+	 *
+	 * This page offset number (origpagepostingoff) will be used to pretend
+	 * that the posting split has already taken place, even though the
+	 * required modifications to origpage won't occur until we reach the
+	 * critical section.  The lastleft and firstright tuples of our page split
+	 * point should, in effect, come from an imaginary version of origpage
+	 * that has the nposting tuple instead of the original posting list tuple.
+	 *
+	 * Note: _bt_findsplitloc() should have compensated for coinciding posting
+	 * list splits in just the same way, at least in theory.  It doesn't
+	 * bother with that, though.  In practice it won't affect its choice of
+	 * split point.
+	 */
+	origpagepostingoff = InvalidOffsetNumber;
+	if (postingoff != 0)
+	{
+		Assert(isleaf);
+		Assert(ItemPointerCompare(&orignewitem->t_tid,
+								  &newitem->t_tid) < 0);
+		Assert(BTreeTupleIsPosting(nposting));
+		origpagepostingoff = OffsetNumberPrev(newitemoff);
+	}
+
 	/*
 	 * The "high key" for the new left page will be the first key that's going
 	 * to go into the new right page, or a truncated version if this is a leaf
@@ -1335,6 +1580,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		if (firstright == origpagepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1368,6 +1615,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			if (lastleftoff == origpagepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1383,6 +1632,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 */
 	leftoff = P_HIKEY;
 
+	Assert(BTreeTupleIsPivot(lefthikey) || !itup_key->heapkeyspace);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
@@ -1447,6 +1697,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		Assert(BTreeTupleIsPivot(item) || !itup_key->heapkeyspace);
 		Assert(BTreeTupleGetNAtts(item, rel) > 0);
 		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
@@ -1475,8 +1726,16 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/* replace original item with nposting due to posting split? */
+		if (i == origpagepostingoff)
+		{
+			Assert(BTreeTupleIsPosting(item));
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1645,8 +1904,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo.level;
+		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
+		xlrec.postingoff = 0;
+		if (postingoff != 0 && origpagepostingoff < firstright)
+			xlrec.postingoff = postingoff;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1665,11 +1928,35 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * because it's included with all the other items on the right page.)
 		 * Show the new item as belonging to the left page buffer, so that it
 		 * is not stored if XLogInsert decides it needs a full-page image of
-		 * the left page.  We store the offset anyway, though, to support
-		 * archive compression of these records.
+		 * the left page.  We always store newitemoff in the record, though.
+		 *
+		 * The details are sometimes slightly different for page splits that
+		 * coincide with a posting list split.  If both the replacement
+		 * posting list and newitem go on the right page, then we don't need
+		 * to log anything extra, just like the simple !newitemonleft
+		 * no-posting-split case (postingoff is set to zero in the WAL record,
+		 * so recovery doesn't need to process a posting list split at all).
+		 * Otherwise, we set postingoff and log orignewitem instead of
+		 * newitem, despite having actually inserted newitem.  REDO routine
+		 * must reconstruct nposting and newitem using _bt_swap_posting().
+		 *
+		 * Note: It's possible that our page split point is the point that
+		 * makes the posting list lastleft and newitem firstright.  This is
+		 * the only case where we log orignewitem/newitem despite newitem
+		 * going on the right page.  If XLogInsert decides that it can omit
+		 * orignewitem due to logging a full-page image of the left page,
+		 * everything still works out, since recovery only needs to log
+		 * orignewitem for items on the left page (just like the regular
+		 * newitem-logged case).
 		 */
-		if (newitemonleft)
+		if (newitemonleft && xlrec.postingoff == 0)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		else if (xlrec.postingoff != 0)
+		{
+			Assert(newitemonleft || firstright == newitemoff);
+			Assert(MAXALIGN(newitemsz) == IndexTupleSize(orignewitem));
+			XLogRegisterBufData(0, (char *) orignewitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1829,7 +2116,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2185,6 +2472,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastlevel = metad->btm_level;
 		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -2265,7 +2553,7 @@ _bt_pgaddtup(Page page,
 static void
 _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 {
-	OffsetNumber deletable[MaxOffsetNumber];
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
 	OffsetNumber offnum,
 				minoff,
@@ -2298,6 +2586,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page, or when deduplication runs.
 	 */
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f05cbe7467..529eed027b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -37,6 +38,8 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 bool *rightsib_empty);
+static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+									 OffsetNumber *deletable, int ndeletable);
 static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
 								   BlockNumber *target, BlockNumber *rightsib);
@@ -47,7 +50,8 @@ static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
  */
 void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+				 bool allequalimage)
 {
 	BTMetaPageData *metad;
 	BTPageOpaque metaopaque;
@@ -63,6 +67,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 	metad->btm_fastlevel = level;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	metad->btm_allequalimage = allequalimage;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -102,6 +107,9 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+	/* Only a REINDEX can set this field */
+	Assert(!metad->btm_allequalimage);
+	metad->btm_allequalimage = false;
 
 	/* Adjust pd_lower (see _bt_initmetapage() for details) */
 	((PageHeader) page)->pd_lower =
@@ -213,6 +221,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.fastlevel = metad->btm_fastlevel;
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -274,6 +283,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
 		Assert(metad->btm_version <= BTREE_VERSION);
+		Assert(!metad->btm_allequalimage ||
+			   metad->btm_version > BTREE_NOVAC_VERSION);
 		Assert(metad->btm_root != P_NONE);
 
 		rootblkno = metad->btm_fastroot;
@@ -394,6 +405,7 @@ _bt_getroot(Relation rel, int access)
 			md.fastlevel = 0;
 			md.oldest_btpo_xact = InvalidTransactionId;
 			md.last_cleanup_num_heap_tuples = -1.0;
+			md.allequalimage = metad->btm_allequalimage;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
 
@@ -618,22 +630,34 @@ _bt_getrootheight(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_allequalimage ||
+		   metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
 /*
- *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *	_bt_metaversion() -- Get version/status info from metapage.
+ *
+ *		Sets caller's *heapkeyspace and *allequalimage arguments using data
+ *		from the B-Tree metapage (could be locally-cached version).  This
+ *		information needs to be stashed in insertion scankey, so we provide a
+ *		single function that fetches both at once.
  *
  *		This is used to determine the rules that must be used to descend a
  *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
  *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
  *		performance when inserting a new BTScanInsert-wise duplicate tuple
  *		among many leaf pages already full of such duplicates.
+ *
+ *		Also sets allequalimage field, which indicates whether or not it is
+ *		safe to apply deduplication.  We rely on the assumption that
+ *		btm_allequalimage will be zero'ed on heapkeyspace indexes that were
+ *		pg_upgrade'd from Postgres 12.
  */
-bool
-_bt_heapkeyspace(Relation rel)
+void
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 {
 	BTMetaPageData *metad;
 
@@ -651,10 +675,11 @@ _bt_heapkeyspace(Relation rel)
 		 */
 		if (metad->btm_root == P_NONE)
 		{
-			uint32		btm_version = metad->btm_version;
+			*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+			*allequalimage = metad->btm_allequalimage;
 
 			_bt_relbuf(rel, metabuf);
-			return btm_version > BTREE_NOVAC_VERSION;
+			return;
 		}
 
 		/*
@@ -678,9 +703,12 @@ _bt_heapkeyspace(Relation rel)
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 	Assert(metad->btm_version <= BTREE_VERSION);
+	Assert(!metad->btm_allequalimage ||
+		   metad->btm_version > BTREE_NOVAC_VERSION);
 	Assert(metad->btm_fastroot != P_NONE);
 
-	return metad->btm_version > BTREE_NOVAC_VERSION;
+	*heapkeyspace = metad->btm_version > BTREE_NOVAC_VERSION;
+	*allequalimage = metad->btm_allequalimage;
 }
 
 /*
@@ -964,28 +992,106 @@ _bt_page_recyclable(Page page)
  * Delete item(s) from a btree leaf page during VACUUM.
  *
  * This routine assumes that the caller has a super-exclusive write lock on
- * the buffer.  Also, the given deletable array *must* be sorted in ascending
- * order.
+ * the buffer.  Also, the given deletable and updatable arrays *must* be
+ * sorted in ascending order.
+ *
+ * Routine deals with deleting TIDs when some (but not all) of the heap TIDs
+ * in an existing posting list item are to be removed by VACUUM.  This works
+ * by updating/overwriting an existing item with caller's new version of the
+ * item (a version that lacks the TIDs that are to be deleted).
  *
  * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
  * generate their own latestRemovedXid by accessing the heap directly, whereas
- * VACUUMs rely on the initial heap scan taking care of it indirectly.
+ * VACUUMs rely on the initial heap scan taking care of it indirectly.  Also,
+ * only VACUUM can perform granular deletes of individual TIDs in posting list
+ * tuples.
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *deletable, int ndeletable)
+					OffsetNumber *deletable, int ndeletable,
+					BTVacuumPosting *updatable, int nupdatable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	char	   *updatedbuf = NULL;
+	Size		updatedbuflen = 0;
+	OffsetNumber updatedoffsets[MaxIndexTuplesPerPage];
 
 	/* Shouldn't be called unless there's something to do */
-	Assert(ndeletable > 0);
+	Assert(ndeletable > 0 || nupdatable > 0);
+
+	for (int i = 0; i < nupdatable; i++)
+	{
+		/* Replace work area IndexTuple with updated version */
+		_bt_update_posting(updatable[i]);
+
+		/* Maintain array of updatable page offsets for WAL record */
+		updatedoffsets[i] = updatable[i]->updatedoffset;
+	}
+
+	/* XLOG stuff -- allocate and fill buffer before critical section */
+	if (nupdatable > 0 && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nupdatable; i++)
+		{
+			BTVacuumPosting vacposting = updatable[i];
+
+			itemsz = SizeOfBtreeUpdate +
+				vacposting->ndeletedtids * sizeof(uint16);
+			updatedbuflen += itemsz;
+		}
+
+		updatedbuf = palloc(updatedbuflen);
+		for (int i = 0; i < nupdatable; i++)
+		{
+			BTVacuumPosting vacposting = updatable[i];
+			xl_btree_update update;
+
+			update.ndeletedtids = vacposting->ndeletedtids;
+			memcpy(updatedbuf + offset, &update.ndeletedtids,
+				   SizeOfBtreeUpdate);
+			offset += SizeOfBtreeUpdate;
+
+			itemsz = update.ndeletedtids * sizeof(uint16);
+			memcpy(updatedbuf + offset, vacposting->deletetids, itemsz);
+			offset += itemsz;
+		}
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
-	/* Fix the page */
-	PageIndexMultiDelete(page, deletable, ndeletable);
+	/*
+	 * Handle posting tuple updates.
+	 *
+	 * Deliberately do this before handling simple deletes.  If we did it the
+	 * other way around (i.e. WAL record order -- simple deletes before
+	 * updates) then we'd have to make compensating changes to the 'updatable'
+	 * array of offset numbers.
+	 *
+	 * PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
+	 * happens to already be set.  Although we unset the BTP_HAS_GARBAGE page
+	 * level flag, unsetting individual LP_DEAD bits should still be avoided.
+	 */
+	for (int i = 0; i < nupdatable; i++)
+	{
+		OffsetNumber updatedoffset = updatedoffsets[i];
+		IndexTuple	itup;
+
+		itup = updatable[i]->itup;
+		itemsz = MAXALIGN(IndexTupleSize(itup));
+		if (!PageIndexTupleOverwrite(page, updatedoffset, (Item) itup,
+									 itemsz))
+			elog(PANIC, "could not update partially dead item in block %u of index \"%s\"",
+				 BufferGetBlockNumber(buf), RelationGetRelationName(rel));
+	}
+
+	/* Now handle simple deletes of entire tuples */
+	if (ndeletable > 0)
+		PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1006,7 +1112,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	 * limited, since we never falsely unset an LP_DEAD bit.  Workloads that
 	 * are particularly dependent on LP_DEAD bits being set quickly will
 	 * usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
-	 * again anyway.
+	 * again anyway.  Furthermore, attempting a deduplication pass will remove
+	 * all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
+	 * is set or not.
 	 */
 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
@@ -1019,18 +1127,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.ndeleted = ndeletable;
+		xlrec_vacuum.nupdated = nupdatable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
 		XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
 
-		/*
-		 * The deletable array is not in the buffer, but pretend that it is.
-		 * When XLogInsert stores the whole buffer, the array need not be
-		 * stored too.
-		 */
-		XLogRegisterBufData(0, (char *) deletable,
-							ndeletable * sizeof(OffsetNumber));
+		if (ndeletable > 0)
+			XLogRegisterBufData(0, (char *) deletable,
+								ndeletable * sizeof(OffsetNumber));
+
+		if (nupdatable > 0)
+		{
+			XLogRegisterBufData(0, (char *) updatedoffsets,
+								nupdatable * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, updatedbuf, updatedbuflen);
+		}
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1038,6 +1150,13 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 	}
 
 	END_CRIT_SECTION();
+
+	/* can't leak memory here */
+	if (updatedbuf != NULL)
+		pfree(updatedbuf);
+	/* free tuples generated by calling _bt_update_posting() */
+	for (int i = 0; i < nupdatable; i++)
+		pfree(updatable[i]->itup);
 }
 
 /*
@@ -1050,6 +1169,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
  * the page, but it needs to generate its own latestRemovedXid by accessing
  * the heap.  This is used by the REDO routine to generate recovery conflicts.
+ * Also, it doesn't handle posting list tuples unless the entire tuple can be
+ * deleted as a whole (since there is only one LP_DEAD bit per line pointer).
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
@@ -1065,8 +1186,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 deletable, ndeletable);
+			_bt_xid_horizon(rel, heapRel, page, deletable, ndeletable);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
@@ -1113,6 +1233,83 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+/*
+ * Get the latestRemovedXid from the table entries pointed to by the non-pivot
+ * tuples being deleted.
+ *
+ * This is a specialized version of index_compute_xid_horizon_for_tuples().
+ * It's needed because btree tuples don't always store table TID using the
+ * standard index tuple header field.
+ */
+static TransactionId
+_bt_xid_horizon(Relation rel, Relation heapRel, Page page,
+				OffsetNumber *deletable, int ndeletable)
+{
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	int			spacenhtids;
+	int			nhtids;
+	ItemPointer htids;
+
+	/* Array will grow iff there are posting list tuples to consider */
+	spacenhtids = ndeletable;
+	nhtids = 0;
+	htids = (ItemPointer) palloc(sizeof(ItemPointerData) * spacenhtids);
+	for (int i = 0; i < ndeletable; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, deletable[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+		Assert(!BTreeTupleIsPivot(itup));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			if (nhtids + 1 > spacenhtids)
+			{
+				spacenhtids *= 2;
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &htids[nhtids]);
+			nhtids++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			if (nhtids + nposting > spacenhtids)
+			{
+				spacenhtids = Max(spacenhtids * 2, nhtids + nposting);
+				htids = (ItemPointer)
+					repalloc(htids, sizeof(ItemPointerData) * spacenhtids);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &htids[nhtids]);
+				nhtids++;
+			}
+		}
+	}
+
+	Assert(nhtids >= ndeletable);
+
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, htids, nhtids);
+
+	pfree(htids);
+
+	return latestRemovedXid;
+}
+
 /*
  * Returns true, if the given block has the half-dead flag set.
  */
@@ -2058,6 +2255,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+			xlmeta.allequalimage = metad->btm_allequalimage;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5254bc7ef5..4bb16297c3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,10 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
+										  IndexTuple posting,
+										  OffsetNumber updatedoffset,
+										  int *nremaining);
 
 
 /*
@@ -161,7 +165,7 @@ btbuildempty(Relation index)
 
 	/* Construct metapage. */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
+	_bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
 
 	/*
 	 * Write the page and log it.  It might seem that an immediate sync would
@@ -264,8 +268,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxTIDsPerBTreePage * sizeof(int));
+				if (so->numKilled < MaxTIDsPerBTreePage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1154,11 +1158,15 @@ restart:
 	}
 	else if (P_ISLEAF(opaque))
 	{
-		OffsetNumber deletable[MaxOffsetNumber];
+		OffsetNumber deletable[MaxIndexTuplesPerPage];
 		int			ndeletable;
+		BTVacuumPosting updatable[MaxIndexTuplesPerPage];
+		int			nupdatable;
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		int			nhtidsdead,
+					nhtidslive;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1190,8 +1198,11 @@ restart:
 		 * point using callback.
 		 */
 		ndeletable = 0;
+		nupdatable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
+		nhtidsdead = 0;
+		nhtidslive = 0;
 		if (callback)
 		{
 			for (offnum = minoff;
@@ -1199,11 +1210,9 @@ restart:
 				 offnum = OffsetNumberNext(offnum))
 			{
 				IndexTuple	itup;
-				ItemPointer htup;
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
 				/*
 				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
@@ -1226,22 +1235,82 @@ restart:
 				 * simple, and allows us to always avoid generating our own
 				 * conflicts.
 				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				Assert(!BTreeTupleIsPivot(itup));
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Regular tuple, standard table TID representation */
+					if (callback(&itup->t_tid, callback_state))
+					{
+						deletable[ndeletable++] = offnum;
+						nhtidsdead++;
+					}
+					else
+						nhtidslive++;
+				}
+				else
+				{
+					BTVacuumPosting vacposting;
+					int			nremaining;
+
+					/* Posting list tuple */
+					vacposting = btreevacuumposting(vstate, itup, offnum,
+													&nremaining);
+					if (vacposting == NULL)
+					{
+						/*
+						 * All table TIDs from the posting tuple remain, so no
+						 * delete or update required
+						 */
+						Assert(nremaining == BTreeTupleGetNPosting(itup));
+					}
+					else if (nremaining > 0)
+					{
+
+						/*
+						 * Store metadata about posting list tuple in
+						 * updatable array for entire page.  Existing tuple
+						 * will be updated during the later call to
+						 * _bt_delitems_vacuum().
+						 */
+						Assert(nremaining < BTreeTupleGetNPosting(itup));
+						updatable[nupdatable++] = vacposting;
+						nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+					}
+					else
+					{
+						/*
+						 * All table TIDs from the posting list must be
+						 * deleted.  We'll delete the index tuple completely
+						 * (no update required).
+						 */
+						Assert(nremaining == 0);
+						deletable[ndeletable++] = offnum;
+						nhtidsdead += BTreeTupleGetNPosting(itup);
+						pfree(vacposting);
+					}
+
+					nhtidslive += nremaining;
+				}
 			}
 		}
 
 		/*
-		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
-		 * call per page, so as to minimize WAL traffic.
+		 * Apply any needed deletes or updates.  We issue just one
+		 * _bt_delitems_vacuum() call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nupdatable > 0)
 		{
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+			Assert(nhtidsdead >= Max(ndeletable, 1));
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+								nupdatable);
 
-			stats->tuples_removed += ndeletable;
+			stats->tuples_removed += nhtidsdead;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
+
+			/* can't leak memory here */
+			for (int i = 0; i < nupdatable; i++)
+				pfree(updatable[i]);
 		}
 		else
 		{
@@ -1254,6 +1323,7 @@ restart:
 			 * We treat this like a hint-bit update because there's no need to
 			 * WAL-log it.
 			 */
+			Assert(nhtidsdead == 0);
 			if (vstate->cycleid != 0 &&
 				opaque->btpo_cycleid == vstate->cycleid)
 			{
@@ -1263,15 +1333,18 @@ restart:
 		}
 
 		/*
-		 * If it's now empty, try to delete; else count the live tuples. We
-		 * don't delete when recursing, though, to avoid putting entries into
-		 * freePages out-of-order (doesn't seem worth any extra code to handle
-		 * the case).
+		 * If it's now empty, try to delete; else count the live tuples (live
+		 * table TIDs in posting lists are counted as separate live tuples).
+		 * We don't delete when recursing, though, to avoid putting entries
+		 * into freePages out-of-order (doesn't seem worth any extra code to
+		 * handle the case).
 		 */
 		if (minoff > maxoff)
 			delete_now = (blkno == orig_blkno);
 		else
-			stats->num_index_tuples += maxoff - minoff + 1;
+			stats->num_index_tuples += nhtidslive;
+
+		Assert(!delete_now || nhtidslive == 0);
 	}
 
 	if (delete_now)
@@ -1303,9 +1376,10 @@ restart:
 	/*
 	 * This is really tail recursion, but if the compiler is too stupid to
 	 * optimize it as such, we'd eat an uncomfortably large amount of stack
-	 * space per recursion level (due to the deletable[] array). A failure is
-	 * improbable since the number of levels isn't likely to be large ... but
-	 * just in case, let's hand-optimize into a loop.
+	 * space per recursion level (due to the arrays used to track details of
+	 * deletable/updatable items).  A failure is improbable since the number
+	 * of levels isn't likely to be large ...  but just in case, let's
+	 * hand-optimize into a loop.
 	 */
 	if (recurse_to != P_NONE)
 	{
@@ -1314,6 +1388,61 @@ restart:
 	}
 }
 
+/*
+ * btreevacuumposting --- determine TIDs still needed in posting list
+ *
+ * Returns metadata describing how to build replacement tuple without the TIDs
+ * that VACUUM needs to delete.  Returned value is NULL in the common case
+ * where no changes are needed to caller's posting list tuple (we avoid
+ * allocating memory here as an optimization).
+ *
+ * The number of TIDs that should remain in the posting list tuple is set for
+ * caller in *nremaining.
+ */
+static BTVacuumPosting
+btreevacuumposting(BTVacState *vstate, IndexTuple posting,
+				   OffsetNumber updatedoffset, int *nremaining)
+{
+	int			live = 0;
+	int			nitem = BTreeTupleGetNPosting(posting);
+	ItemPointer items = BTreeTupleGetPosting(posting);
+	BTVacuumPosting vacposting = NULL;
+
+	for (int i = 0; i < nitem; i++)
+	{
+		if (!vstate->callback(items + i, vstate->callback_state))
+		{
+			/* Live table TID */
+			live++;
+		}
+		else if (vacposting == NULL)
+		{
+			/*
+			 * First dead table TID encountered.
+			 *
+			 * It's now clear that we need to delete one or more dead table
+			 * TIDs, so start maintaining metadata describing how to update
+			 * existing posting list tuple.
+			 */
+			vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
+								nitem * sizeof(uint16));
+
+			vacposting->itup = posting;
+			vacposting->updatedoffset = updatedoffset;
+			vacposting->ndeletedtids = 0;
+			vacposting->deletetids[vacposting->ndeletedtids++] = i;
+		}
+		else
+		{
+			/* Second or subsequent dead table TID */
+			vacposting->deletetids[vacposting->ndeletedtids++] = i;
+		}
+	}
+
+	*nremaining = live;
+	return vacposting;
+}
+
 /*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index df065d72f8..7aaa8c17b0 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer heapTid,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum,
+									   ItemPointer heapTid, int tupleOffset);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -142,6 +150,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
+		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		blkno = BTreeTupleGetDownLink(itup);
 		par_blkno = BufferGetBlockNumber(*bufP);
 
@@ -434,7 +443,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by a posting
+ * list split).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +465,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->postingoff == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +522,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set postingoff for caller.  Caller must split the
+		 * posting list when postingoff is set.  This should happen
+		 * infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -528,6 +551,73 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 	return low;
 }
 
+/*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Helper routine for _bt_binsrch_insert().
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	Assert(key->heapkeyspace && key->allequalimage);
+
+	/*
+	 * In the event that posting list tuple has LP_DEAD bit set, indicate this
+	 * to _bt_binsrch_insert() caller by returning -1, a sentinel value.  A
+	 * second call to _bt_binsrch_insert() can take place when its caller has
+	 * removed the dead item.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res > 0)
+			low = mid + 1;
+		else if (res < 0)
+			high = mid;
+		else
+			return mid;
+	}
+
+	/* Exact match not found */
+	return low;
+}
+
 /*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
@@ -537,9 +627,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +658,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -592,12 +688,12 @@ _bt_compare(Relation rel,
 
 	ncmpkey = Min(ntupatts, key->keysz);
 	Assert(key->heapkeyspace || ncmpkey == key->keysz);
+	Assert(!BTreeTupleIsPosting(itup) || key->allequalimage);
 	scankey = key->scankeys;
 	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -712,8 +808,25 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * Scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * with scantid.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (result <= 0 || !BTreeTupleIsPosting(itup))
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid,
+									BTreeTupleGetMaxHeapTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1228,7 +1341,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
-	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
 	inskey.anynullkeys = false; /* unused */
 	inskey.nextkey = nextkey;
 	inskey.pivotsearch = false;
@@ -1483,9 +1596,35 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID
+					 */
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					itemIndex++;
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1518,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxTIDsPerBTreePage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1526,7 +1665,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxTIDsPerBTreePage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1567,9 +1706,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 										 &continuescan);
 			if (passes_quals && tuple_alive)
 			{
-				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				/* tuple passes all scan key conditions */
+				if (!BTreeTupleIsPosting(itup))
+				{
+					/* Remember it */
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			tupleOffset;
+
+					/*
+					 * Set up state to return posting list, and remember first
+					 * TID.
+					 *
+					 * Note that we deliberately save/return items from
+					 * posting lists in ascending heap TID order for backwards
+					 * scans.  This allows _bt_killitems() to make a
+					 * consistent assumption about the order of items
+					 * associated with the same posting list tuple.
+					 */
+					itemIndex--;
+					tupleOffset =
+						_bt_setuppostingitems(so, itemIndex, offnum,
+											  BTreeTupleGetPostingN(itup, 0),
+											  itup);
+					/* Remember additional TIDs */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											tupleOffset);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1583,8 +1754,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxTIDsPerBTreePage - 1;
+		so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1597,6 +1768,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPivot(itup) && !BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1609,6 +1782,71 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 	}
 }
 
+/*
+ * Setup state to save TIDs/items from a single posting list tuple.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for TID that is
+ * returned to scan first.  Second or subsequent TIDs for posting list should
+ * be saved by calling _bt_savepostingitem().
+ *
+ * Returns an offset into tuple storage space that main tuple is stored at if
+ * needed.
+ */
+static int
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer heapTid, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	Assert(BTreeTupleIsPosting(itup));
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+	if (so->currTuples)
+	{
+		/* Save base IndexTuple (truncate posting list) */
+		IndexTuple	base;
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+		memcpy(base, itup, itupsz);
+		/* Defensively reduce work area index tuple header size */
+		base->t_info &= ~INDEX_SIZE_MASK;
+		base->t_info |= itupsz;
+		so->currPos.nextTupleOffset += itupsz;
+
+		return currItem->tupleOffset;
+	}
+
+	return 0;
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for current posting
+ * tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.  Caller passes its return value as tupleOffset.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer heapTid, int tupleOffset)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *heapTid;
+	currItem->indexOffset = offnum;
+
+	/*
+	 * Have index-only scans return the same base IndexTuple for every TID
+	 * that originates from the same posting list
+	 */
+	if (so->currTuples)
+		currItem->tupleOffset = tupleOffset;
+}
+
 /*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index baec5de999..e66cd36dfa 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
 	BlockNumber btps_blkno;		/* block # to write this page at */
 	IndexTuple	btps_lowkey;	/* page's strict lower bound pivot tuple */
 	OffsetNumber btps_lastoff;	/* last item offset loaded */
+	Size		btps_lastextra; /* last item's extra posting list space */
 	uint32		btps_level;		/* tree level (0 = leaf) */
 	Size		btps_full;		/* "full" if less than this much free space */
 	struct BTPageState *btps_next;	/* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
 static void _bt_sortaddtup(Page page, Size itemsize,
 						   IndexTuple itup, OffsetNumber itup_off);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
-						 IndexTuple itup);
+						 IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+										  BTPageState *state,
+										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
@@ -563,6 +567,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	/* _bt_mkscankey() won't set allequalimage without metapage */
+	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -711,6 +717,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
 	state->btps_lowkey = NULL;
 	/* initialize lastoff so first item goes into P_FIRSTKEY */
 	state->btps_lastoff = P_HIKEY;
+	state->btps_lastextra = 0;
 	state->btps_level = level;
 	/* set "full" threshold based on level.  See notes at head of file. */
 	if (level > 0)
@@ -789,7 +796,8 @@ _bt_sortaddtup(Page page,
 }
 
 /*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
  *
  * We must be careful to observe the page layout conventions of nbtsearch.c:
  * - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +829,27 @@ _bt_sortaddtup(Page page,
  * the truncated high key at offset 1.
  *
  * 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any.  This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off.  Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page).  Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
  *----------
  */
 static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+			 Size truncextra)
 {
 	Page		npage;
 	BlockNumber nblkno;
 	OffsetNumber last_off;
+	Size		last_truncextra;
 	Size		pgspc;
 	Size		itupsz;
 	bool		isleaf;
@@ -842,6 +863,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	npage = state->btps_page;
 	nblkno = state->btps_blkno;
 	last_off = state->btps_lastoff;
+	last_truncextra = state->btps_lastextra;
+	state->btps_lastextra = truncextra;
 
 	pgspc = PageGetFreeSpace(npage);
 	itupsz = IndexTupleSize(itup);
@@ -883,10 +906,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * page.  Disregard fillfactor and insert on "full" current page if we
 	 * don't have the minimum number of items yet.  (Note that we deliberately
 	 * assume that suffix truncation neither enlarges nor shrinks new high key
-	 * when applying soft limit.)
+	 * when applying soft limit, except when last tuple has a posting list.)
 	 */
 	if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
-		(pgspc < state->btps_full && last_off > P_FIRSTKEY))
+		(pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
 	{
 		/*
 		 * Finish off the page and write it out.
@@ -944,11 +967,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * We don't try to bias our choice of split point to make it more
 			 * likely that _bt_truncate() can truncate away more attributes,
 			 * whereas the split point used within _bt_split() is chosen much
-			 * more delicately.  Suffix truncation is mostly useful because it
-			 * improves space utilization for workloads with random
-			 * insertions.  It doesn't seem worthwhile to add logic for
-			 * choosing a split point here for a benefit that is bound to be
-			 * much smaller.
+			 * more delicately.  Even still, the lastleft and firstright
+			 * tuples passed to _bt_truncate() here are at least not fully
+			 * equal to each other when deduplication is used, unless there is
+			 * a large group of duplicates (also, unique index builds usually
+			 * have few or no spool2 duplicates).  When the split point is
+			 * between two unequal tuples, _bt_truncate() will avoid including
+			 * a heap TID in the new high key, which is the most important
+			 * benefit of suffix truncation.
 			 *
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
@@ -983,7 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
 			   !P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
 		BTreeTupleSetDownLink(state->btps_lowkey, oblkno);
-		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+		_bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
 		pfree(state->btps_lowkey);
 
 		/*
@@ -1045,6 +1071,43 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	state->btps_lastoff = last_off;
 }
 
+/*
+ * Finalize pending posting list tuple, and add it to the index.  Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd().
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+							  BTDedupState dstate)
+{
+	Assert(dstate->nitems > 0);
+
+	if (dstate->nitems == 1)
+		_bt_buildadd(wstate, state, dstate->base, 0);
+	else
+	{
+		IndexTuple	postingtuple;
+		Size		truncextra;
+
+		/* form a tuple with a posting list */
+		postingtuple = _bt_form_posting(dstate->base,
+										dstate->htids,
+										dstate->nhtids);
+		/* Calculate posting list overhead */
+		truncextra = IndexTupleSize(postingtuple) -
+			BTreeTupleGetPostingOffset(postingtuple);
+
+		_bt_buildadd(wstate, state, postingtuple, truncextra);
+		pfree(postingtuple);
+	}
+
+	dstate->nhtids = 0;
+	dstate->nitems = 0;
+	dstate->phystupsize = 0;
+}
+
 /*
  * Finish writing out the completed btree.
  */
@@ -1090,7 +1153,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 			Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
 			BTreeTupleSetDownLink(s->btps_lowkey, blkno);
-			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+			_bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
 			pfree(s->btps_lowkey);
 			s->btps_lowkey = NULL;
 		}
@@ -1111,7 +1174,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 	 * by filling in a valid magic number in the metapage.
 	 */
 	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, rootblkno, rootlevel);
+	_bt_initmetapage(metapage, rootblkno, rootlevel,
+					 wstate->inskey->allequalimage);
 	_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
 }
 
@@ -1132,6 +1196,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate;
+
+	deduplicate = wstate->inskey->allequalimage &&
+		BTGetDeduplicateItems(wstate->index);
 
 	if (merge)
 	{
@@ -1228,12 +1296,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 
 			if (load1)
 			{
-				_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup, 0);
 				itup = tuplesort_getindextuple(btspool->sortstate, true);
 			}
 			else
 			{
-				_bt_buildadd(wstate, state, itup2);
+				_bt_buildadd(wstate, state, itup2, 0);
 				itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
 			}
 
@@ -1243,9 +1311,100 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		}
 		pfree(sortKeys);
 	}
+	else if (deduplicate)
+	{
+		/* merge is unnecessary, deduplicate into posting lists */
+		BTDedupState dstate;
+
+		dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		dstate->deduplicate = true; /* unused */
+		dstate->maxpostingsize = 0; /* set later */
+		/* Metadata about base tuple of current pending posting list */
+		dstate->base = NULL;
+		dstate->baseoff = InvalidOffsetNumber;	/* unused */
+		dstate->basetupsize = 0;
+		/* Metadata about current pending posting list TIDs */
+		dstate->htids = NULL;
+		dstate->nhtids = 0;
+		dstate->nitems = 0;
+		dstate->phystupsize = 0;	/* unused */
+		dstate->nintervals = 0; /* unused */
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true)) != NULL)
+		{
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+			{
+				state = _bt_pagestate(wstate, 0);
+
+				/*
+				 * Limit size of posting list tuples to 1/10 space we want to
+				 * leave behind on the page, plus space for final item's line
+				 * pointer.  This is equal to the space that we'd like to
+				 * leave behind on each leaf page when fillfactor is 90,
+				 * allowing us to get close to fillfactor% space utilization
+				 * when there happen to be a great many duplicates.  (This
+				 * makes higher leaf fillfactor settings ineffective when
+				 * building indexes that have many duplicates, but packing
+				 * leaf pages full with few very large tuples doesn't seem
+				 * like a useful goal.)
+				 */
+				dstate->maxpostingsize = MAXALIGN_DOWN((BLCKSZ * 10 / 100)) -
+					sizeof(ItemIdData);
+				Assert(dstate->maxpostingsize <= BTMaxItemSize(state->btps_page) &&
+					   dstate->maxpostingsize <= INDEX_SIZE_MASK);
+				dstate->htids = palloc(dstate->maxpostingsize);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+										 itup) > keysz &&
+					 _bt_dedup_save_htid(dstate, itup))
+			{
+				/*
+				 * Tuple is equal to base tuple of pending posting list.  Heap
+				 * TID from itup has been saved in state.
+				 */
+			}
+			else
+			{
+				/*
+				 * Tuple is not equal to pending posting list tuple, or
+				 * _bt_dedup_save_htid() opted to not merge current item into
+				 * pending posting list.
+				 */
+				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				pfree(dstate->base);
+
+				/* start new pending posting list with itup copy */
+				_bt_dedup_start_pending(dstate, CopyIndexTuple(itup),
+										InvalidOffsetNumber);
+			}
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+
+		if (state)
+		{
+			/*
+			 * Handle the last item (there must be a last item when the
+			 * tuplesort returned one or more tuples)
+			 */
+			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			pfree(dstate->base);
+			pfree(dstate->htids);
+		}
+
+		pfree(dstate);
+	}
 	else
 	{
-		/* merge is unnecessary */
+		/* merging and deduplication are both unnecessary */
 		while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true)) != NULL)
 		{
@@ -1253,7 +1412,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			if (state == NULL)
 				state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+			_bt_buildadd(wstate, state, itup, 0);
 
 			/* Report progress */
 			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 76c2d945c8..8ba055be9e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,6 +462,7 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsz = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
@@ -468,8 +472,30 @@ _bt_recsplitloc(FindSplitData *state,
 	if (newitemisfirstonright)
 		firstrightitemsz = state->newitemsz;
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/*
+		 * Calculate suffix truncation space saving when firstright is a
+		 * posting list tuple, though only when the firstright is over 64
+		 * bytes including line pointer overhead (arbitrary).  This avoids
+		 * accessing the tuple in cases where its posting list must be very
+		 * small (if firstright has one at all).
+		 */
+		if (state->is_leaf && firstrightitemsz > 64)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsz = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -491,11 +517,17 @@ _bt_recsplitloc(FindSplitData *state,
 	 * If we are on the leaf level, assume that suffix truncation cannot avoid
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
-	 * will rarely be larger, but conservatively assume the worst case.
+	 * will rarely be larger, but conservatively assume the worst case.  We do
+	 * go to the trouble of subtracting away posting list overhead, though
+	 * only when it looks like it will make an appreciable difference.
+	 * (Posting lists are the only case where truncation will typically make
+	 * the final high key far smaller than firstright, so being a bit more
+	 * precise there noticeably improves the balance of free space.)
 	 */
 	if (state->is_leaf)
 		leftfree -= (int16) (firstrightitemsz +
-							 MAXALIGN(sizeof(ItemPointerData)));
+							 MAXALIGN(sizeof(ItemPointerData)) -
+							 postingsz);
 	else
 		leftfree -= (int16) firstrightitemsz;
 
@@ -691,7 +723,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 74d1f5dd1e..80fe1ac004 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -81,7 +81,10 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		determine whether or not the keys in the index are expected to be
  *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
  *		heapkeyspace index when caller passes a NULL tuple, allowing index
- *		build callers to avoid accessing the non-existent metapage.
+ *		build callers to avoid accessing the non-existent metapage.  We
+ *		also assume that the index is _not_ allequalimage when a NULL tuple
+ *		is passed; CREATE INDEX callers call _bt_allequalimage() to set the
+ *		field themselves.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -108,7 +111,14 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
-	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+	if (itup)
+		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
+	else
+	{
+		/* Utility statement callers can set these fields themselves */
+		key->heapkeyspace = true;
+		key->allequalimage = false;
+	}
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
@@ -1374,6 +1384,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			continue;
 		}
 
@@ -1535,6 +1546,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			 * attribute passes the qual.
 			 */
 			Assert(ScanDirectionIsForward(dir));
+			Assert(BTreeTupleIsPivot(tuple));
 			cmpresult = 0;
 			if (subkey->sk_flags & SK_ROW_END)
 				break;
@@ -1774,10 +1786,65 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				/*
+				 * Note that we rely on the assumption that heap TIDs in the
+				 * scanpos items array are always in ascending heap TID order
+				 * within a posting list
+				 */
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* kitem must have matching offnum when heap TIDs match */
+					Assert(kitem->indexOffset == offnum);
+
+					/*
+					 * Read-ahead to later kitems here.
+					 *
+					 * We rely on the assumption that not advancing kitem here
+					 * will prevent us from considering the posting list tuple
+					 * fully dead by not matching its next heap TID in next
+					 * loop iteration.
+					 *
+					 * If, on the other hand, this is the final heap TID in
+					 * the posting list tuple, then tuple gets killed
+					 * regardless (i.e. we handle the case where the last
+					 * kitem is also the last heap TID in the last index tuple
+					 * correctly -- posting tuple still gets killed).
+					 */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				/*
+				 * Don't bother advancing the outermost loop's int iterator to
+				 * avoid processing killed items that relate to the same
+				 * offnum/posting list tuple.  This micro-optimization hardly
+				 * seems worth it.  (Further iterations of the outermost loop
+				 * will fail to match on this same posting list's first heap
+				 * TID instead, so we'll advance to the next offnum/index
+				 * tuple pretty quickly.)
+				 */
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2018,7 +2085,9 @@ btoptions(Datum reloptions, bool validate)
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
 		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+		{"deduplicate_items", RELOPT_TYPE_BOOL,
+		offsetof(BTOptions, deduplicate_items)}
 
 	};
 
@@ -2119,11 +2188,10 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples.  It's never okay to
-	 * truncate a second time.
+	 * We should only ever truncate non-pivot tuples from leaf pages.  It's
+	 * never okay to truncate when splitting an internal page.
 	 */
-	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
-	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
 	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
@@ -2139,6 +2207,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(pivot))
+		{
+			/*
+			 * index_truncate_tuple() just returns a straight copy of
+			 * firstright when it has no key attributes to truncate.  We need
+			 * to truncate away the posting list ourselves.
+			 */
+			Assert(keepnatts == nkeyatts);
+			Assert(natts == nkeyatts);
+			pivot->t_info &= ~INDEX_SIZE_MASK;
+			pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+		}
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2155,6 +2236,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2172,6 +2255,19 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
+
+		if (BTreeTupleIsPosting(firstright))
+		{
+			/*
+			 * New pivot tuple was copied from firstright, which happens to be
+			 * a posting list tuple.  We will have to include the max lastleft
+			 * heap TID in the final pivot tuple, but we can remove the
+			 * posting list now. (Pivot tuples should never contain a posting
+			 * list.)
+			 */
+			newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+				MAXALIGN(sizeof(ItemPointerData));
+		}
 	}
 
 	/*
@@ -2199,7 +2295,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2210,9 +2306,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2225,7 +2324,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2234,7 +2333,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2301,6 +2401,13 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		keepnatts++;
 	}
 
+	/*
+	 * Assert that _bt_keep_natts_fast() agrees with us in passing.  This is
+	 * expected in an allequalimage index.
+	 */
+	Assert(!itup_key->allequalimage ||
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+
 	return keepnatts;
 }
 
@@ -2315,13 +2422,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting.  When an index only has
+ * "equal image" columns, routine is guaranteed to give the same result as
+ * _bt_keep_natts would.
  *
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * definitely also equal according to _bt_keep_natts, even when the index uses
+ * an opclass or collation that is not "allequalimage"/deduplication-safe.
+ * This weaker guarantee is good enough for nbtsplitloc.c caller, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2393,28 +2503,42 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * Mask allocated for number of keys in index tuple must be able to fit
 	 * maximum possible number of index attributes
 	 */
-	StaticAssertStmt(BT_N_KEYS_OFFSET_MASK >= INDEX_MAX_KEYS,
-					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
+	StaticAssertStmt(BT_OFFSET_MASK >= INDEX_MAX_KEYS,
+					 "BT_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* Posting list tuples should never have "pivot heap TID" bit set */
+	if (BTreeTupleIsPosting(itup) &&
+		(ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+		 BT_PIVOT_HEAP_TID_ATTR) != 0)
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2458,12 +2582,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2489,7 +2613,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2559,8 +2687,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
@@ -2633,9 +2761,8 @@ _bt_allequalimage(Relation rel, bool debugmessage)
 			elog(DEBUG1, "index \"%s\" can safely use deduplication",
 				 RelationGetRelationName(rel));
 		else
-			ereport(NOTICE,
-					(errmsg("index \"%s\" cannot use deduplication",
-							RelationGetRelationName(rel))));
+			elog(DEBUG1, "index \"%s\" cannot use deduplication",
+				 RelationGetRelationName(rel));
 	}
 
 	return allequalimage;
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2e5202c2d6..72d3b63f3c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx;		/* working memory for operations */
 
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_allequalimage = xlrec->allequalimage;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
 }
 
 static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+				  XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (!posting)
+		{
+			/* Simple retail insertion */
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			ItemId		itemid;
+			IndexTuple	oposting,
+						newitem,
+						nposting;
+			uint16		postingoff;
+
+			/*
+			 * A posting list split occurred during leaf page insertion.  WAL
+			 * record data will start with an offset number representing the
+			 * point in an existing posting list that a split occurs at.
+			 *
+			 * Use _bt_swap_posting() to repeat posting list split steps from
+			 * primary.  Note that newitem from WAL record is 'orignewitem',
+			 * not the final version of newitem that is actually inserted on
+			 * page.
+			 */
+			postingoff = *((uint16 *) datapos);
+			datapos += sizeof(uint16);
+			datalen -= sizeof(uint16);
+
+			itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			oposting = (IndexTuple) PageGetItem(page, itemid);
+
+			/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+			Assert(isleaf && postingoff > 0);
+			newitem = CopyIndexTuple((IndexTuple) datapos);
+			nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+			/* Replace existing posting list with post-split version */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* Insert "final" new item (not orignewitem from WAL stream) */
+			Assert(IndexTupleSize(newitem) == datalen);
+			if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->postingoff != 0)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			if (xlrec->postingoff != 0)
+			{
+				ItemId		itemid;
+				IndexTuple	oposting;
+
+				/* Posting list must be at offset number before new item's */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				/* Use mutable, aligned newitem copy in _bt_swap_posting() */
+				newitem = CopyIndexTuple(newitem);
+				itemid = PageGetItemId(lpage, replacepostingoff);
+				oposting = (IndexTuple) PageGetItem(lpage, itemid);
+				nposting = _bt_swap_posting(newitem, oposting,
+											xlrec->postingoff);
+			}
 		}
 
 		/*
@@ -308,8 +374,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			/* Add replacement posting list when required */
+			if (off == replacepostingoff)
+			{
+				Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+				if (PageAddItem(newlpage, (Item) nposting,
+								MAXALIGN(IndexTupleSize(nposting)), leftoff,
+								false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new posting list item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;		/* don't insert oposting */
+			}
+
 			/* add the new item if it was inserted on left page */
-			if (onleft && off == xlrec->newitemoff)
+			else if (onleft && off == xlrec->newitemoff)
 			{
 				if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
 								false, false) == InvalidOffsetNumber)
@@ -383,6 +461,98 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 	}
 }
 
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		char	   *ptr = XLogRecGetBlockData(record, 0, NULL);
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		OffsetNumber offnum,
+					minoff,
+					maxoff;
+		BTDedupState state;
+		BTDedupInterval *intervals;
+		Page		newpage;
+
+		state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+		state->deduplicate = true;	/* unused */
+		/* Conservatively use larger maxpostingsize than primary */
+		state->maxpostingsize = BTMaxItemSize(page);
+		state->base = NULL;
+		state->baseoff = InvalidOffsetNumber;
+		state->basetupsize = 0;
+		state->htids = palloc(state->maxpostingsize);
+		state->nhtids = 0;
+		state->nitems = 0;
+		state->phystupsize = 0;
+		state->nintervals = 0;
+
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		intervals = (BTDedupInterval *) ptr;
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (offnum == minoff)
+				_bt_dedup_start_pending(state, itup, offnum);
+			else if (state->nintervals < xlrec->nintervals &&
+					 state->baseoff == intervals[state->nintervals].baseoff &&
+					 state->nitems < intervals[state->nintervals].nitems)
+			{
+				if (!_bt_dedup_save_htid(state, itup))
+					elog(ERROR, "could not add heap tid to pending posting list");
+			}
+			else
+			{
+				_bt_dedup_finish_pending(newpage, state);
+				_bt_dedup_start_pending(state, itup, offnum);
+			}
+		}
+
+		_bt_dedup_finish_pending(newpage, state);
+		Assert(state->nintervals == xlrec->nintervals);
+		Assert(memcmp(state->intervals, intervals,
+					  state->nintervals * sizeof(BTDedupInterval)) == 0);
+
+		if (P_HAS_GARBAGE(opaque))
+		{
+			BTPageOpaque nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+			nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+		}
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
 static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
@@ -405,7 +575,56 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		page = (Page) BufferGetPage(buffer);
 
-		PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+		if (xlrec->nupdated > 0)
+		{
+			OffsetNumber *updatedoffsets;
+			xl_btree_update *updates;
+
+			updatedoffsets = (OffsetNumber *)
+				(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+			updates = (xl_btree_update *) ((char *) updatedoffsets +
+										   xlrec->nupdated *
+										   sizeof(OffsetNumber));
+
+			for (int i = 0; i < xlrec->nupdated; i++)
+			{
+				BTVacuumPosting vacposting;
+				IndexTuple	origtuple;
+				ItemId		itemid;
+				Size		itemsz;
+
+				itemid = PageGetItemId(page, updatedoffsets[i]);
+				origtuple = (IndexTuple) PageGetItem(page, itemid);
+
+				vacposting = palloc(offsetof(BTVacuumPostingData, deletetids) +
+									updates->ndeletedtids * sizeof(uint16));
+				vacposting->updatedoffset = updatedoffsets[i];
+				vacposting->itup = origtuple;
+				vacposting->ndeletedtids = updates->ndeletedtids;
+				memcpy(vacposting->deletetids,
+					   (char *) updates + SizeOfBtreeUpdate,
+					   updates->ndeletedtids * sizeof(uint16));
+
+				_bt_update_posting(vacposting);
+
+				/* Overwrite updated version of tuple */
+				itemsz = MAXALIGN(IndexTupleSize(vacposting->itup));
+				if (!PageIndexTupleOverwrite(page, updatedoffsets[i],
+											 (Item) vacposting->itup, itemsz))
+					elog(PANIC, "could not update partially dead item");
+
+				pfree(vacposting->itup);
+				pfree(vacposting);
+
+				/* advance to next xl_btree_update/update */
+				updates = (xl_btree_update *)
+					((char *) updates + SizeOfBtreeUpdate +
+					 updates->ndeletedtids * sizeof(uint16));
+			}
+		}
+
+		if (xlrec->ndeleted > 0)
+			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
 		 * Mark the page as not containing any LP_DEAD items --- see comments
@@ -724,17 +943,19 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	MemoryContext oldCtx;
 
+	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
-			btree_xlog_insert(true, false, record);
+			btree_xlog_insert(true, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_UPPER:
-			btree_xlog_insert(false, false, record);
+			btree_xlog_insert(false, false, false, record);
 			break;
 		case XLOG_BTREE_INSERT_META:
-			btree_xlog_insert(false, true, record);
+			btree_xlog_insert(false, true, false, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
 			btree_xlog_split(true, record);
@@ -742,6 +963,12 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_INSERT_POST:
+			btree_xlog_insert(true, false, true, record);
+			break;
+		case XLOG_BTREE_DEDUP:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
@@ -767,6 +994,23 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+	opCtx = AllocSetContextCreate(CurrentMemoryContext,
+								  "Btree recovery temporary context",
+								  ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+	MemoryContextDelete(opCtx);
+	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 7d63a7124e..7a1616f371 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
 		case XLOG_BTREE_INSERT_META:
+		case XLOG_BTREE_INSERT_POST:
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
@@ -38,15 +39,24 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
-				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
-								 xlrec->level, xlrec->firstright, xlrec->newitemoff);
+				appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+								 xlrec->level, xlrec->firstright,
+								 xlrec->newitemoff, xlrec->postingoff);
+				break;
+			}
+		case XLOG_BTREE_DEDUP:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "nintervals %u", xlrec->nintervals);
 				break;
 			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+				appendStringInfo(buf, "ndeleted %u; nupdated %u",
+								 xlrec->ndeleted, xlrec->nupdated);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -130,6 +140,12 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_INSERT_POST:
+			id = "INSERT_POST";
+			break;
+		case XLOG_BTREE_DEDUP:
+			id = "DEDUP";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 4ea6ea7a3d..cb7b8c8a63 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1048,8 +1048,10 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
  * This is better than deleting and reinserting the tuple, because it
  * avoids any data shifting when the tuple size doesn't change; and
  * even when it does, we avoid moving the line pointers around.
- * Conceivably this could also be of use to an index AM that cares about
- * the physical order of tuples as well as their ItemId order.
+ * This could be used by an index AM that doesn't want to unset the
+ * LP_DEAD bit when it happens to be set.  It could conceivably also be
+ * used by an index AM that cares about the physical order of tuples as
+ * well as their logical/ItemId order.
  *
  * If there's insufficient space for the new tuple, return false.  Other
  * errors represent data-corruption problems, so we just elog.
@@ -1134,8 +1136,9 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 		}
 	}
 
-	/* Update the item's tuple length (other fields shouldn't change) */
-	ItemIdSetNormal(tupid, offset + size_diff, newsize);
+	/* Update the item's tuple length without changing its lp_flags field */
+	tupid->lp_off = offset + size_diff;
+	tupid->lp_len = newsize;
 
 	/* Copy new tuple data onto page */
 	memcpy(PageGetItem(page, tupid), newtup, newsize);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index dc03fbde13..b6b08d0ccb 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1731,14 +1731,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor", "deduplicate_items",	/* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =",	/* BTREE */
+					  "vacuum_cleanup_index_scale_factor =", "deduplicate_items =",	/* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6a058ccdac..8a830e570c 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
 									  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 									 IndexTuple itup);
+static inline IndexTuple bt_posting_plain_tuple(IndexTuple itup, int n);
 static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 											   OffsetNumber offset);
@@ -167,6 +168,7 @@ static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
 								   Page page, OffsetNumber offset);
 static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
 													  IndexTuple itup, bool nonpivot);
+static inline ItemPointer BTreeTupleGetPointsToTID(IndexTuple itup);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -278,7 +280,8 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 
 	if (btree_index_mainfork_expected(indrel))
 	{
-		bool	heapkeyspace;
+		bool		heapkeyspace,
+					allequalimage;
 
 		RelationOpenSmgr(indrel);
 		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
@@ -288,7 +291,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 							RelationGetRelationName(indrel))));
 
 		/* Check index, possibly against table it is an index on */
-		heapkeyspace = _bt_heapkeyspace(indrel);
+		_bt_metaversion(indrel, &heapkeyspace, &allequalimage);
 		bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
 							 heapallindexed, rootdescend);
 	}
@@ -419,12 +422,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Size Bloom filter based on estimated number of tuples in index,
 		 * while conservatively assuming that each block must contain at least
-		 * MaxIndexTuplesPerPage / 5 non-pivot tuples.  (Non-leaf pages cannot
-		 * contain non-pivot tuples.  That's okay because they generally make
-		 * up no more than about 1% of all pages in the index.)
+		 * MaxTIDsPerBTreePage / 3 "plain" tuples -- see
+		 * bt_posting_plain_tuple() for definition, and details of how posting
+		 * list tuples are handled.
 		 */
 		total_pages = RelationGetNumberOfBlocks(rel);
-		total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+		total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3),
 						  (int64) state->rel->rd_rel->reltuples);
 		/* Random seed relies on backend srandom() call to avoid repetition */
 		seed = random();
@@ -924,6 +927,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -954,13 +958,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
 							 offset))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -994,18 +1000,20 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * TID, since the posting list itself is validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
-			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1017,6 +1025,40 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is a posting list tuple, make sure posting list TIDs are
+		 * in order
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid = psprintf("(%u,%u)", state->targetblock, offset);
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s posting list offset=%d page lsn=%X/%X.",
+												itid, i,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1049,13 +1091,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
 					   BTMaxItemSizeNoHeapTid(state->target)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1074,12 +1117,32 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				/* Fingerprint all elements as distinct "plain" tuples */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					IndexTuple	logtuple;
+
+					logtuple = bt_posting_plain_tuple(itup, i);
+					norm = bt_normalize_tuple(state, logtuple);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != logtuple)
+						pfree(norm);
+					pfree(logtuple);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1150,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,17 +1191,22 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && BTreeTupleIsPosting(itup))
+			skey->scantid = BTreeTupleGetMaxHeapTID(itup);
+
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
+			ItemPointer tid = BTreeTupleGetPointsToTID(itup);
 			char	   *itid,
 					   *htid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1150,6 +1219,8 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		/* Reset, in case scantid was set to (itup) posting tuple's max TID */
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (OffsetNumberNext(offset) <= max &&
 			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
+			ItemPointer tid;
 			char	   *itid,
 					   *htid,
 					   *nitid,
 					   *nhtid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetPointsToTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,10 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			tid = BTreeTupleGetPointsToTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1953,10 +2027,9 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2042,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	IndexTuple	reformed;
 	int			i;
 
+	/* Caller should only pass "logical" non-pivot tuples here */
+	Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
 	/* Easy case: It's immediately clear that tuple has no varlena datums */
 	if (!IndexTupleHasVarwidths(itup))
 		return itup;
@@ -2031,6 +2107,29 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Produce palloc()'d "plain" tuple for nth posting list entry/TID.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index.  Multiple index tuples are merged together into one equivalent
+ * posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "plain"
+ * tuples.  Each tuple must be fingerprinted separately -- there must be one
+ * tuple for each corresponding Bloom filter probe during the heap scan.
+ *
+ * Note: Caller still needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_plain_tuple(IndexTuple itup, int n)
+{
+	Assert(BTreeTupleIsPosting(itup));
+
+	/* Returns non-posting-list tuple */
+	return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
 /*
  * Search for itup in index, starting from fast root page.  itup must be a
  * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2186,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.postingoff = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2194,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.postingoff <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2650,69 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
 }
 
 /*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples)
  */
 static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
-	BlockNumber targetblock = state->targetblock;
+	ItemPointer htid;
 
-	if (result == NULL && nonpivot)
+	/*
+	 * Caller determines whether this is supposed to be a pivot or non-pivot
+	 * tuple using page type and item offset number.  Verify that tuple
+	 * metadata agrees with this.
+	 */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	if (!BTreeTupleIsPivot(itup) && !nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("block %u or its right sibling block or child block in index \"%s\" has unexpected non-pivot tuple",
+								 state->targetblock,
+								 RelationGetRelationName(state->rel))));
+
+	htid = BTreeTupleGetHeapTID(itup);
+	if (!ItemPointerIsValid(htid) && nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
-						targetblock, RelationGetRelationName(state->rel))));
+						state->targetblock,
+						RelationGetRelationName(state->rel))));
 
-	return result;
+	return htid;
+}
+
+/*
+ * Return the "pointed to" TID for itup, which is used to generate a
+ * descriptive error message.  itup must be a "data item" tuple (it wouldn't
+ * make much sense to call here with a high key tuple, since there won't be a
+ * valid downlink/block number to display).
+ *
+ * Returns either a heap TID (which will be the first heap TID in posting list
+ * if itup is posting list tuple), or a TID that contains downlink block
+ * number, plus some encoded metadata (e.g., the number of attributes present
+ * in itup).
+ */
+static inline ItemPointer
+BTreeTupleGetPointsToTID(IndexTuple itup)
+{
+	/*
+	 * Rely on the assumption that !heapkeyspace internal page data items will
+	 * correctly return TID with downlink here -- BTreeTupleGetHeapTID() won't
+	 * recognize it as a pivot tuple, but everything still works out because
+	 * the t_tid field is still returned
+	 */
+	if (!BTreeTupleIsPivot(itup))
+		return BTreeTupleGetHeapTID(itup);
+
+	/* Pivot tuple returns TID with downlink block (heapkeyspace variant) */
+	return &itup->t_tid;
 }
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index f9526ac19c..ea896ff847 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -546,11 +546,226 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
 <sect1 id="btree-implementation">
  <title>Implementation</title>
 
+ <para>
+  This section covers B-Tree index implementation details that may be
+  of use to advanced users.  See
+  <filename>src/backend/access/nbtree/README</filename> in the source
+  distribution for a much more detailed, internals-focused description
+  of the B-Tree implementation.
+ </para>
+ <sect2 id="btree-structure">
+  <title>Structure</title>
   <para>
-   An introduction to the btree index implementation can be found in
-   <filename>src/backend/access/nbtree/README</filename>.
+   <productname>PostgreSQL</productname> B-Tree indexes are
+   multi-level tree structures, where each level of the tree can be
+   used as a doubly-linked list of pages.  A single metapage is stored
+   in a fixed position at the start of the first segment file of the
+   index.  All other pages are either leaf pages or internal pages.
+   Leaf pages are the pages on the lowest level of the tree.  All
+   other levels consist of internal pages.  Each leaf page contains
+   tuples that point to table entries using a heap item pointer.  Each
+   internal page contains tuples that point to the next level down in
+   the tree.  Typically, over 99% of all pages are leaf pages.  Both
+   internal pages and leaf pages use the standard page format
+   described in <xref linkend="storage-page-layout"/>.
+  </para>
+  <para>
+   New pages are added to a B-Tree index when an existing page becomes
+   full.  A <firstterm>page split</firstterm> is performed, which
+   makes room for items that belong on the overflowing page by moving
+   a portion of the items to a new page.  Splits in leaf pages insert
+   a new tuple into the original page's parent page, which may cause
+   the parent page to split in turn.  Page splits <quote>cascade
+    upwards</quote> in a recursive fashion.  When the root page cannot
+   fit a new item, a <firstterm>root page split</firstterm> is
+   performed.  This adds a new level to the tree structure.
+  </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication">
+  <title>Deduplication</title>
+  <para>
+   A duplicate is a tuple where <emphasis>all</emphasis> indexed key
+   columns have values that match corresponding column values from at
+   least one other tuple in the same index.  In practice, duplicate
+   tuples are quite common.  B-Tree has an optimization that stores
+   duplicates using a space efficient representation: deduplication.
+   Deduplication periodically replaces each contiguous group of
+   duplicate tuples with a single equivalent posting list tuple.  The
+   keys appear only once in this representation, followed by a sorted
+   array of heap item pointers.  This significantly reduces the
+   storage size of indexes where each value (or each distinct set of
+   column values) appears several times on average.  The latency of
+   queries can be reduced significantly.  Overall query throughput may
+   increase significantly.  The overhead of routine index vacuuming
+   may also be significantly reduced.
+  </para>
+  <note>
+   <para>
+    While NULL is generally not considered to be equal to any other
+    value, including NULL, NULL is nevertheless treated as just
+    another value from the domain of indexed values by the
+    B-Tree implementation.  B-Tree deduplication is just as effective
+    with <quote>duplicates</quote> that contain a NULL value.
+   </para>
+  </note>
+  <para>
+   The deduplication process occurs <quote>lazily</quote>, when a new
+   item is inserted that cannot fit on an existing leaf page.  This
+   prevents (or at least delays) leaf page splits.  Unlike GIN posting
+   list tuples, B-Tree posting list tuples do not need to expand every
+   time a new duplicate is inserted; they are merely an alternative
+   physical representation of the original logical contents found on
+   the page.  B-Tree indexes can store duplicates efficiently, without
+   adding overhead to read operators or to most individual write
+   operations.
+  </para>
+  <para>
+   Workloads that don't benefit from deduplication due to having no
+   duplicate values in indexes will incur a small, fixed performance
+   penalty with write heavy workloads (unless deduplication is
+   explicitly disabled).  The <literal>deduplicate_items</literal>
+   storage parameter can be used to disable deduplication within
+   individual indexes.  See <xref
+    linkend="sql-createindex-storage-parameters"/> from the
+   <command>CREATE INDEX</command> documentation for details.  There
+   is never any performance penalty with read-only workloads, since
+   reading from posting lists is at least as efficient as reading the
+   standard index tuple representation.
+  </para>
+  <para>
+   <productname>PostgreSQL</productname> uses <acronym>MVCC</acronym>
+   to maintain data consistency.  This impacts nearly every major
+   subsystem, including the B-Tree index access method.  B-Tree
+   indexes may contain multiple physical tuples for the same logical
+   table row, even in unique indexes.  Note that
+   <command>UPDATE</command> statements that avoid modifying most (but
+   not all) of the columns that are covered by indexes will generally
+   still need succesor index tuples that point to a new physical row
+   version <emphasis>for each and every index</emphasis>.  These
+   implementation-level duplicates are sometimes a significant source
+   of index bloat.
+  </para>
+  <para>
+   Deduplication tends to avoid page splits that are only needed due
+   to a short-term increase in <quote>duplicate</quote> tuples that
+   all point to different versions of the same logical table row.
+   <command>VACUUM</command> or autovacuum will eventually remove dead
+   versions of tuples from every index in any case, but
+   <command>VACUUM</command> usually cannot reverse page splits (in
+   general, a leaf page must be completely empty before
+   <command>VACUUM</command> can <quote>delete</quote> it).  In
+   effect, deduplication delays <quote>version driven</quote> page
+   splits, which may give VACUUM enough time to run and prevent the
+   splits entirely.  Unique indexes make use of deduplication for this
+   reason.  Also, even unique indexes can have a set of
+   <quote>duplicate</quote> rows that are all visible to a given
+   <acronym>MVCC</acronym> snapshot, provided at least one column has
+   a NULL value.  In general, the implementation considers tuples with
+   NULL values to be duplicates for the purposes of deduplication.
+  </para>
+  <para>
+   Unique indexes can only contain non-NULL duplicates because of
+   version churn.  The implementation applies a special heuristic when
+   considering whether to attempt deduplication in a unique index.
+   This heuristic virtually avoids the possibility of a performance
+   penalty in unique indexes.
+  </para>
+  <note>
+   <para>
+    Like all <productname>PostgreSQL</productname> index access
+    methods, B-Tree does not have direct access to visibility
+    information.  B-Tree deduplication does not distinguish duplicates
+    caused by <command>UPDATE</command> statements (that needed
+    successor versions) from duplicates that were created by
+    <command>INSERT</command> statements.
+   </para>
+  </note>
+  <para>
+   Typically, most B-Tree indexes can make use of deduplication
+   without any special configuration on the administrator's part.
+   Note, however, that deduplication cannot be used in all cases.
+   Deduplication is only deemed safe when <emphasis>all</emphasis>
+   indexed columns use an operator class that has an
+   <function>equalimage</function> function, and the function returns
+   <literal>true</literal>.  Deduplication safety is determined when
+   <command>CREATE INDEX</command> or <command>REINDEX</command> run.
+   Note that deduplication cannot be used in the following cases:
+  </para>
+  <para>
+   <itemizedlist>
+    <listitem>
+     <para>
+      <type>text</type>, <type>varchar</type>, <type>bpchar</type> and
+      <type>name</type> cannot use deduplication when the collation is
+      a <emphasis>nondeterministic</emphasis> collation.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>numeric</type> cannot use deduplication.  In general, a
+      pair of equal <type>numeric</type> datums may still have
+      different <quote>display scales</quote>.  These differences must
+      be preserved.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>jsonb</type> cannot use deduplication, since the
+      <type>jsonb</type> B-Tree operator class uses
+      <type>numeric</type> internally.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      <type>float4</type>, <type>float8</type> and <type>money</type>
+      cannot use deduplication.  Each of these types has
+      representations for both <literal>-0</literal> and
+      <literal>0</literal> that are treated as equal.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+  <para>
+   There are several implementation-level restriction that may be
+   lifted in a future version of
+   <productname>PostgreSQL</productname>:
+  </para>
+  <para>
+   <itemizedlist>
+    <listitem>
+     <para>
+      enum types cannot use deduplication.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      Container types (such as composite types, arrays, or range
+      types) cannot use deduplication.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+  <para>
+   There is a further implementation-level restriction that prevents
+   the use of deduplication, regardless of the types, operator
+   classes, or collations that are used by an index:
+  </para>
+  <para>
+   <itemizedlist>
+    <listitem>
+     <para>
+      <literal>INCLUDE</literal> indexes can never use deduplication.
+     </para>
+    </listitem>
+   </itemizedlist>
   </para>
 
+ </sect2>
 </sect1>
 
 </chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 057a6bb81a..20cdfabd7b 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      nondeterministic collations give a more <quote>correct</quote> behavior,
      especially when considering the full power of Unicode and its many
      special cases, they also have some drawbacks.  Foremost, their use leads
-     to a performance penalty.  Also, certain operations are not possible with
-     nondeterministic collations, such as pattern matching operations.
-     Therefore, they should be used only in cases where they are specifically
-     wanted.
+     to a performance penalty.  Note, in particular, that B-tree cannot use
+     deduplication with indexes that use a nondeterministic collation.  Also,
+     certain operations are not possible with nondeterministic collations,
+     such as pattern matching operations.  Therefore, they should be used
+     only in cases where they are specifically wanted.
     </para>
    </sect3>
   </sect2>
diff --git a/doc/src/sgml/citext.sgml b/doc/src/sgml/citext.sgml
index 667824fb0b..5986601327 100644
--- a/doc/src/sgml/citext.sgml
+++ b/doc/src/sgml/citext.sgml
@@ -233,9 +233,10 @@ SELECT * FROM users WHERE nick = 'Larry';
      <para>
        <type>citext</type> is not as efficient as <type>text</type> because the
        operator functions and the B-tree comparison functions must make copies
-       of the data and convert it to lower case for comparisons. It is,
-       however, slightly more efficient than using <function>lower</function> to get
-       case-insensitive matching.
+       of the data and convert it to lower case for comparisons.  Also, only
+       <type>text</type> can support B-Tree deduplication.  However,
+       <type>citext</type> is slightly more efficient than using
+       <function>lower</function> to get case-insensitive matching.
      </para>
     </listitem>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ceda48e0fc..28035f1635 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16561,10 +16561,11 @@ AND
    rows.  Two rows might have a different binary representation even
    though comparisons of the two rows with the equality operator is true.
    The ordering of rows under these comparison operators is deterministic
-   but not otherwise meaningful.  These operators are used internally for
-   materialized views and might be useful for other specialized purposes
-   such as replication but are not intended to be generally useful for
-   writing queries.
+   but not otherwise meaningful.  These operators are used internally
+   for materialized views and might be useful for other specialized
+   purposes such as replication and B-Tree deduplication (see <xref
+   linkend="btree-deduplication"/>).  They are not intended to be
+   generally useful for writing queries, though.
   </para>
   </sect2>
  </sect1>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ab362a0dc5..a05e2e6b9c 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
         maximum size allowed for the index type, data insertion will fail.
         In any case, non-key columns duplicate data from the index's table
         and bloat the size of the index, thus potentially slowing searches.
+        Furthermore, B-tree deduplication is never used with indexes
+        that have a non-key column.
        </para>
 
        <para>
@@ -393,10 +395,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </variablelist>
 
    <para>
-    B-tree indexes additionally accept this parameter:
+    B-tree indexes also accept these parameters:
    </para>
 
    <variablelist>
+   <varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
+    <term><literal>deduplicate_items</literal>
+     <indexterm>
+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>
+     </indexterm>
+    </term>
+    <listitem>
+    <para>
+      Controls usage of the B-tree deduplication technique described
+      in <xref linkend="btree-deduplication"/>.  Set to
+      <literal>ON</literal> or <literal>OFF</literal> to enable or
+      disable the optimization.  (Alternative spellings of
+      <literal>ON</literal> and <literal>OFF</literal> are allowed as
+      described in <xref linkend="config-setting"/>.) The default is
+      <literal>ON</literal>.
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>deduplicate_items</literal> off via
+      <command>ALTER INDEX</command> prevents future insertions from
+      triggering deduplication, but does not in itself make existing
+      posting list tuples use the standard tuple representation.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
     <term><literal>vacuum_cleanup_index_scale_factor</literal>
      <indexterm>
@@ -451,9 +482,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      This setting controls usage of the fast update technique described in
      <xref linkend="gin-fast-update"/>.  It is a Boolean parameter:
      <literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
-     (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
-     allowed as described in <xref linkend="config-setting"/>.)  The
-     default is <literal>ON</literal>.
+     The default is <literal>ON</literal>.
     </para>
 
     <note>
@@ -805,6 +834,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) INCLUDE (director, rating);
 </programlisting>
   </para>
 
+  <para>
+   To create a B-Tree index with deduplication disabled:
+<programlisting>
+CREATE INDEX title_idx ON films (title) WITH (deduplicate_items = off);
+</programlisting>
+  </para>
+
   <para>
    To create an index on the expression <literal>lower(title)</literal>,
    allowing efficient case-insensitive searches:
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..1646deb092 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -200,7 +200,7 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 -- Also check LIKE optimization with binary-compatible cases
 create temp table btree_bpchar (f1 text collate "C");
-create index on btree_bpchar(f1 bpchar_ops);
+create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
 insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
 -- doesn't match index:
 explain (costs off)
@@ -266,6 +266,24 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
  fool
 (2 rows)
 
+-- get test coverage for "single value" deduplication strategy:
+insert into btree_bpchar select 'foo' from generate_series(1,1500);
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..6e14b935ce 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -86,7 +86,7 @@ reset enable_bitmapscan;
 -- Also check LIKE optimization with binary-compatible cases
 
 create temp table btree_bpchar (f1 text collate "C");
-create index on btree_bpchar(f1 bpchar_ops);
+create index on btree_bpchar(f1 bpchar_ops) WITH (deduplicate_items=on);
 insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
 -- doesn't match index:
 explain (costs off)
@@ -103,6 +103,26 @@ explain (costs off)
 select * from btree_bpchar where f1::bpchar like 'foo%';
 select * from btree_bpchar where f1::bpchar like 'foo%';
 
+-- get test coverage for "single value" deduplication strategy:
+insert into btree_bpchar select 'foo' from generate_series(1,1500);
+
+--
+-- Perform unique checking, with and without the use of deduplication
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplicate_items=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+    FOR r IN 1..1350 LOOP
+        DELETE FROM dedup_unique_test_table;
+        INSERT INTO dedup_unique_test_table SELECT 1;
+    END LOOP;
+END$$;
+
 --
 -- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
-- 
2.17.1

v34-0001-Add-equalimage-B-Tree-support-functions.patchapplication/octet-stream; name=v34-0001-Add-equalimage-B-Tree-support-functions.patchDownload

From dbbf43cf2747446eea8a23b03c05badd5da5f43e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 16 Feb 2020 01:16:02 -0800
Subject: [PATCH v34 1/4] Add equalimage B-Tree support functions.

Invent the concept of a B-Tree equalimage ("equality is image equality")
support function, registered as support function 4.  This indicates
whether it is safe (or not safe) to apply optimizations that assume that
any two datums considered equal by an operator class's order method must
also be equivalent in every way.  This is static information about an
operator class and its underlying type (plus a collation when dealing
with a collatable type).

Register an equalimage routine for almost all of the existing B-Tree
opclasses.  We only need two trivial routines for all of the opclasses
that are included with the core distribution.  There is one routine for
opclasses that index non-collatable types (which returns 'true'
unconditionally), plus another routine for collatable types (which
returns 'true' when the collation is a deterministic collation).

This patch is infrastructure for an upcoming patch that adds B-Tree
deduplication.  It is anticipated that this infrastructure will
eventually be used within the planner.

Author: Peter Geoghegan, Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
---
 src/include/access/nbtree.h                 | 23 ++++--
 src/include/catalog/pg_amproc.dat           | 56 ++++++++++++++
 src/include/catalog/pg_proc.dat             |  6 ++
 src/backend/access/nbtree/nbtutils.c        | 74 ++++++++++++++++++
 src/backend/access/nbtree/nbtvalidate.c     |  8 +-
 src/backend/commands/opclasscmds.c          | 30 +++++++-
 src/backend/utils/adt/datum.c               | 26 +++++++
 src/backend/utils/adt/varlena.c             | 20 +++++
 src/bin/pg_dump/t/002_pg_dump.pl            | 12 ++-
 doc/src/sgml/btree.sgml                     | 85 ++++++++++++++++++++-
 doc/src/sgml/ref/alter_opfamily.sgml        |  7 +-
 doc/src/sgml/ref/create_opclass.sgml        | 14 ++--
 doc/src/sgml/xindex.sgml                    | 19 ++++-
 src/test/regress/expected/alter_generic.out |  8 +-
 src/test/regress/expected/opr_sanity.out    | 38 +++++++++
 src/test/regress/sql/alter_generic.sql      |  3 +
 src/test/regress/sql/opr_sanity.sql         | 18 +++++
 17 files changed, 416 insertions(+), 31 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 20ace69dab..d520066914 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -380,19 +380,29 @@ typedef struct BTMetaPageData
  *	must return < 0, 0, > 0, respectively, in these three cases.
  *
  *	To facilitate accelerated sorting, an operator class may choose to
- *	offer a second procedure (BTSORTSUPPORT_PROC).  For full details, see
- *	src/include/utils/sortsupport.h.
+ *	offer a sortsupport amproc procedure (BTSORTSUPPORT_PROC).  For full
+ *	details, see src/include/utils/sortsupport.h.
  *
  *	To support window frames defined by "RANGE offset PRECEDING/FOLLOWING",
- *	an operator class may choose to offer a third amproc procedure
- *	(BTINRANGE_PROC), independently of whether it offers sortsupport.
- *	For full details, see doc/src/sgml/btree.sgml.
+ *	an operator class may choose to offer an in_range amproc procedure
+ *	(BTINRANGE_PROC).  For full details, see doc/src/sgml/btree.sgml.
+ *
+ *	To support B-Tree deduplication (and possibly other optimizations), an
+ *	operator class may choose to offer an "equality is image equality" proc
+ *	(BTEQUALIMAGE_PROC).  When the procedure returns true, core code can
+ *	assume that any two opclass-equal datums must also be equivalent in
+ *	every way.  When the procedure returns false (or when there is no
+ *	procedure for an opclass), deduplication cannot proceed because equal
+ *	index tuples might be visibly different (e.g. btree/numeric_ops indexes
+ *	can't support deduplication because "5" is equal to but distinct from
+ *	"5.00").  For full details, see doc/src/sgml/btree.sgml.
  */
 
 #define BTORDER_PROC		1
 #define BTSORTSUPPORT_PROC	2
 #define BTINRANGE_PROC		3
-#define BTNProcs			3
+#define BTEQUALIMAGE_PROC	4
+#define BTNProcs			4
 
 /*
  *	We need to be able to tell the difference between read and write
@@ -829,6 +839,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_allequalimage(Relation rel, bool debugmessage);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index c67768fcab..59413507f9 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -17,23 +17,36 @@
   amprocrighttype => 'anyarray', amprocnum => '1', amproc => 'btarraycmp' },
 { amprocfamily => 'btree/bit_ops', amproclefttype => 'bit',
   amprocrighttype => 'bit', amprocnum => '1', amproc => 'bitcmp' },
+{ amprocfamily => 'btree/bit_ops', amproclefttype => 'bit',
+  amprocrighttype => 'bit', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
   amprocrighttype => 'bool', amprocnum => '1', amproc => 'btboolcmp' },
+{ amprocfamily => 'btree/bool_ops', amproclefttype => 'bool',
+  amprocrighttype => 'bool', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '1', amproc => 'bpcharcmp' },
 { amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '2',
   amproc => 'bpchar_sortsupport' },
+{ amprocfamily => 'btree/bpchar_ops', amproclefttype => 'bpchar',
+  amprocrighttype => 'bpchar', amprocnum => '4',
+  amproc => 'btvarstrequalimage' },
 { amprocfamily => 'btree/bytea_ops', amproclefttype => 'bytea',
   amprocrighttype => 'bytea', amprocnum => '1', amproc => 'byteacmp' },
 { amprocfamily => 'btree/bytea_ops', amproclefttype => 'bytea',
   amprocrighttype => 'bytea', amprocnum => '2', amproc => 'bytea_sortsupport' },
+{ amprocfamily => 'btree/bytea_ops', amproclefttype => 'bytea',
+  amprocrighttype => 'bytea', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/char_ops', amproclefttype => 'char',
   amprocrighttype => 'char', amprocnum => '1', amproc => 'btcharcmp' },
+{ amprocfamily => 'btree/char_ops', amproclefttype => 'char',
+  amprocrighttype => 'char', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
   amprocrighttype => 'date', amprocnum => '1', amproc => 'date_cmp' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
   amprocrighttype => 'date', amprocnum => '2', amproc => 'date_sortsupport' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
+  amprocrighttype => 'date', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'date',
   amprocrighttype => 'timestamp', amprocnum => '1',
   amproc => 'date_cmp_timestamp' },
@@ -45,6 +58,8 @@
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
   amprocrighttype => 'timestamp', amprocnum => '2',
   amproc => 'timestamp_sortsupport' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
+  amprocrighttype => 'timestamp', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
   amprocrighttype => 'date', amprocnum => '1', amproc => 'timestamp_cmp_date' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamp',
@@ -56,6 +71,9 @@
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamptz',
   amprocrighttype => 'timestamptz', amprocnum => '2',
   amproc => 'timestamp_sortsupport' },
+{ amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamptz',
+  amprocrighttype => 'timestamptz', amprocnum => '4',
+  amproc => 'btequalimage' },
 { amprocfamily => 'btree/datetime_ops', amproclefttype => 'timestamptz',
   amprocrighttype => 'date', amprocnum => '1',
   amproc => 'timestamptz_cmp_date' },
@@ -96,10 +114,14 @@
 { amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
   amprocrighttype => 'inet', amprocnum => '2',
   amproc => 'network_sortsupport' },
+{ amprocfamily => 'btree/network_ops', amproclefttype => 'inet',
+  amprocrighttype => 'inet', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '1', amproc => 'btint2cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int2', amprocnum => '2', amproc => 'btint2sortsupport' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
+  amprocrighttype => 'int2', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
   amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint24cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int2',
@@ -117,6 +139,8 @@
   amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint4cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
   amprocrighttype => 'int4', amprocnum => '2', amproc => 'btint4sortsupport' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
+  amprocrighttype => 'int4', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
   amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint48cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int4',
@@ -134,6 +158,8 @@
   amprocrighttype => 'int8', amprocnum => '1', amproc => 'btint8cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
   amprocrighttype => 'int8', amprocnum => '2', amproc => 'btint8sortsupport' },
+{ amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
+  amprocrighttype => 'int8', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
   amprocrighttype => 'int4', amprocnum => '1', amproc => 'btint84cmp' },
 { amprocfamily => 'btree/integer_ops', amproclefttype => 'int8',
@@ -146,11 +172,15 @@
 { amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
   amprocrighttype => 'interval', amprocnum => '3',
   amproc => 'in_range(interval,interval,interval,bool,bool)' },
+{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
+  amprocrighttype => 'interval', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
   amprocrighttype => 'macaddr', amprocnum => '1', amproc => 'macaddr_cmp' },
 { amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
   amprocrighttype => 'macaddr', amprocnum => '2',
   amproc => 'macaddr_sortsupport' },
+{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
+  amprocrighttype => 'macaddr', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/numeric_ops', amproclefttype => 'numeric',
   amprocrighttype => 'numeric', amprocnum => '1', amproc => 'numeric_cmp' },
 { amprocfamily => 'btree/numeric_ops', amproclefttype => 'numeric',
@@ -163,60 +193,86 @@
   amprocrighttype => 'oid', amprocnum => '1', amproc => 'btoidcmp' },
 { amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
   amprocrighttype => 'oid', amprocnum => '2', amproc => 'btoidsortsupport' },
+{ amprocfamily => 'btree/oid_ops', amproclefttype => 'oid',
+  amprocrighttype => 'oid', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
   amprocrighttype => 'oidvector', amprocnum => '1',
   amproc => 'btoidvectorcmp' },
+{ amprocfamily => 'btree/oidvector_ops', amproclefttype => 'oidvector',
+  amprocrighttype => 'oidvector', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '1', amproc => 'bttextcmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '2', amproc => 'bttextsortsupport' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'text',
+  amprocrighttype => 'text', amprocnum => '4', amproc => 'btvarstrequalimage' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
   amprocrighttype => 'name', amprocnum => '1', amproc => 'btnamecmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
   amprocrighttype => 'name', amprocnum => '2', amproc => 'btnamesortsupport' },
+{ amprocfamily => 'btree/text_ops', amproclefttype => 'name',
+  amprocrighttype => 'name', amprocnum => '4', amproc => 'btvarstrequalimage' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'name',
   amprocrighttype => 'text', amprocnum => '1', amproc => 'btnametextcmp' },
 { amprocfamily => 'btree/text_ops', amproclefttype => 'text',
   amprocrighttype => 'name', amprocnum => '1', amproc => 'bttextnamecmp' },
 { amprocfamily => 'btree/time_ops', amproclefttype => 'time',
   amprocrighttype => 'time', amprocnum => '1', amproc => 'time_cmp' },
+{ amprocfamily => 'btree/time_ops', amproclefttype => 'time',
+  amprocrighttype => 'time', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/time_ops', amproclefttype => 'time',
   amprocrighttype => 'interval', amprocnum => '3',
   amproc => 'in_range(time,time,interval,bool,bool)' },
 { amprocfamily => 'btree/timetz_ops', amproclefttype => 'timetz',
   amprocrighttype => 'timetz', amprocnum => '1', amproc => 'timetz_cmp' },
+{ amprocfamily => 'btree/timetz_ops', amproclefttype => 'timetz',
+  amprocrighttype => 'timetz', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/timetz_ops', amproclefttype => 'timetz',
   amprocrighttype => 'interval', amprocnum => '3',
   amproc => 'in_range(timetz,timetz,interval,bool,bool)' },
 { amprocfamily => 'btree/varbit_ops', amproclefttype => 'varbit',
   amprocrighttype => 'varbit', amprocnum => '1', amproc => 'varbitcmp' },
+{ amprocfamily => 'btree/varbit_ops', amproclefttype => 'varbit',
+  amprocrighttype => 'varbit', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/text_pattern_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '1', amproc => 'bttext_pattern_cmp' },
 { amprocfamily => 'btree/text_pattern_ops', amproclefttype => 'text',
   amprocrighttype => 'text', amprocnum => '2',
   amproc => 'bttext_pattern_sortsupport' },
+{ amprocfamily => 'btree/text_pattern_ops', amproclefttype => 'text',
+  amprocrighttype => 'text', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/bpchar_pattern_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '1',
   amproc => 'btbpchar_pattern_cmp' },
 { amprocfamily => 'btree/bpchar_pattern_ops', amproclefttype => 'bpchar',
   amprocrighttype => 'bpchar', amprocnum => '2',
   amproc => 'btbpchar_pattern_sortsupport' },
+{ amprocfamily => 'btree/bpchar_pattern_ops', amproclefttype => 'bpchar',
+  amprocrighttype => 'bpchar', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/money_ops', amproclefttype => 'money',
   amprocrighttype => 'money', amprocnum => '1', amproc => 'cash_cmp' },
 { amprocfamily => 'btree/tid_ops', amproclefttype => 'tid',
   amprocrighttype => 'tid', amprocnum => '1', amproc => 'bttidcmp' },
+{ amprocfamily => 'btree/tid_ops', amproclefttype => 'tid',
+  amprocrighttype => 'tid', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
   amprocrighttype => 'uuid', amprocnum => '1', amproc => 'uuid_cmp' },
 { amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
   amprocrighttype => 'uuid', amprocnum => '2', amproc => 'uuid_sortsupport' },
+{ amprocfamily => 'btree/uuid_ops', amproclefttype => 'uuid',
+  amprocrighttype => 'uuid', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/record_ops', amproclefttype => 'record',
   amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordcmp' },
 { amprocfamily => 'btree/record_image_ops', amproclefttype => 'record',
   amprocrighttype => 'record', amprocnum => '1', amproc => 'btrecordimagecmp' },
 { amprocfamily => 'btree/pg_lsn_ops', amproclefttype => 'pg_lsn',
   amprocrighttype => 'pg_lsn', amprocnum => '1', amproc => 'pg_lsn_cmp' },
+{ amprocfamily => 'btree/pg_lsn_ops', amproclefttype => 'pg_lsn',
+  amprocrighttype => 'pg_lsn', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/macaddr8_ops', amproclefttype => 'macaddr8',
   amprocrighttype => 'macaddr8', amprocnum => '1', amproc => 'macaddr8_cmp' },
+{ amprocfamily => 'btree/macaddr8_ops', amproclefttype => 'macaddr8',
+  amprocrighttype => 'macaddr8', amprocnum => '4', amproc => 'btequalimage' },
 { amprocfamily => 'btree/enum_ops', amproclefttype => 'anyenum',
   amprocrighttype => 'anyenum', amprocnum => '1', amproc => 'enum_cmp' },
 { amprocfamily => 'btree/tsvector_ops', amproclefttype => 'tsvector',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index eb3c1a88d1..07a86c7b7b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1013,6 +1013,9 @@
 { oid => '3255', descr => 'sort support',
   proname => 'bttextsortsupport', prorettype => 'void',
   proargtypes => 'internal', prosrc => 'bttextsortsupport' },
+{ oid => '8505', descr => 'equal image',
+  proname => 'btvarstrequalimage', prorettype => 'bool', proargtypes => 'oid',
+  prosrc => 'btvarstrequalimage' },
 { oid => '377', descr => 'less-equal-greater',
   proname => 'cash_cmp', proleakproof => 't', prorettype => 'int4',
   proargtypes => 'money money', prosrc => 'cash_cmp' },
@@ -9483,6 +9486,9 @@
 { oid => '3187', descr => 'less-equal-greater based on byte images',
   proname => 'btrecordimagecmp', prorettype => 'int4',
   proargtypes => 'record record', prosrc => 'btrecordimagecmp' },
+{ oid => '8506', descr => 'equal image',
+  proname => 'btequalimage', prorettype => 'bool', proargtypes => 'oid',
+  prosrc => 'btequalimage' },
 
 # Extensions
 { oid => '3082', descr => 'list available extensions',
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5ab4e712f1..74d1f5dd1e 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "commands/progress.h"
 #include "lib/qunique.h"
 #include "miscadmin.h"
@@ -2566,3 +2567,76 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Are all attributes in rel "equality is image equality" attributes?
+ *
+ * We use each attribute's BTEQUALIMAGE_PROC opclass procedure.  If any
+ * opclass either lacks a BTEQUALIMAGE_PROC procedure or returns false, we
+ * return false; otherwise we return true.
+ *
+ * Returned boolean value is stored in index metapage during index builds.
+ * Deduplication can only be used when we return true.
+ */
+bool
+_bt_allequalimage(Relation rel, bool debugmessage)
+{
+	bool		allequalimage = true;
+
+	/* INCLUDE indexes don't support deduplication */
+	if (IndexRelationGetNumberOfAttributes(rel) !=
+		IndexRelationGetNumberOfKeyAttributes(rel))
+		return false;
+
+	/*
+	 * There is no special reason why deduplication cannot work with system
+	 * relations (i.e. with system catalog indexes and TOAST indexes).  We
+	 * deem deduplication unsafe for these indexes all the same, since the
+	 * alternative is to force users to always use deduplication, without
+	 * being able to opt out.  (ALTER INDEX is not supported with system
+	 * indexes, so users would have no way to set the deduplicate_items
+	 * storage parameter to 'off'.)
+	 */
+	if (IsSystemRelation(rel))
+		return false;
+
+	for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(rel); i++)
+	{
+		Oid			opfamily = rel->rd_opfamily[i];
+		Oid			opcintype = rel->rd_opcintype[i];
+		Oid			collation = rel->rd_indcollation[i];
+		Oid			equalimageproc;
+
+		equalimageproc = get_opfamily_proc(opfamily, opcintype, opcintype,
+										   BTEQUALIMAGE_PROC);
+
+		/*
+		 * If there is no BTEQUALIMAGE_PROC then deduplication is assumed to
+		 * be unsafe.  Otherwise, actually call proc and see what it says.
+		 */
+		if (!OidIsValid(equalimageproc) ||
+			!DatumGetBool(OidFunctionCall1Coll(equalimageproc, collation,
+											   ObjectIdGetDatum(opcintype))))
+		{
+			allequalimage = false;
+			break;
+		}
+	}
+
+	/*
+	 * Don't ereport() until here to avoid reporting on a system relation
+	 * index or an INCLUDE index
+	 */
+	if (debugmessage)
+	{
+		if (allequalimage)
+			elog(DEBUG1, "index \"%s\" can safely use deduplication",
+				 RelationGetRelationName(rel));
+		else
+			ereport(NOTICE,
+					(errmsg("index \"%s\" cannot use deduplication",
+							RelationGetRelationName(rel))));
+	}
+
+	return allequalimage;
+}
diff --git a/src/backend/access/nbtree/nbtvalidate.c b/src/backend/access/nbtree/nbtvalidate.c
index ff634b1649..627f74407a 100644
--- a/src/backend/access/nbtree/nbtvalidate.c
+++ b/src/backend/access/nbtree/nbtvalidate.c
@@ -104,6 +104,10 @@ btvalidate(Oid opclassoid)
 											procform->amprocrighttype,
 											BOOLOID, BOOLOID);
 				break;
+			case BTEQUALIMAGE_PROC:
+				ok = check_amproc_signature(procform->amproc, BOOLOID, true,
+											1, 1, OIDOID);
+				break;
 			default:
 				ereport(INFO,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
@@ -211,8 +215,8 @@ btvalidate(Oid opclassoid)
 
 		/*
 		 * Complain if there seems to be an incomplete set of either operators
-		 * or support functions for this datatype pair.  The only things
-		 * considered optional are the sortsupport and in_range functions.
+		 * or support functions for this datatype pair.  The sortsupport,
+		 * in_range, and equalimage functions are considered optional.
 		 */
 		if (thisgroup->operatorset !=
 			((1 << BTLessStrategyNumber) |
diff --git a/src/backend/commands/opclasscmds.c b/src/backend/commands/opclasscmds.c
index e2c6de457c..743511bdf2 100644
--- a/src/backend/commands/opclasscmds.c
+++ b/src/backend/commands/opclasscmds.c
@@ -1143,9 +1143,10 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid)
 	/*
 	 * btree comparison procs must be 2-arg procs returning int4.  btree
 	 * sortsupport procs must take internal and return void.  btree in_range
-	 * procs must be 5-arg procs returning bool.  hash support proc 1 must be
-	 * a 1-arg proc returning int4, while proc 2 must be a 2-arg proc
-	 * returning int8.  Otherwise we don't know.
+	 * procs must be 5-arg procs returning bool.  btree equalimage procs must
+	 * take 1 arg and return bool.  hash support proc 1 must be a 1-arg proc
+	 * returning int4, while proc 2 must be a 2-arg proc returning int8.
+	 * Otherwise we don't know.
 	 */
 	if (amoid == BTREE_AM_OID)
 	{
@@ -1205,6 +1206,29 @@ assignProcTypes(OpFamilyMember *member, Oid amoid, Oid typeoid)
 			if (!OidIsValid(member->righttype))
 				member->righttype = procform->proargtypes.values[2];
 		}
+		else if (member->number == BTEQUALIMAGE_PROC)
+		{
+			if (procform->pronargs != 1)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("btree equal image functions must have one argument")));
+			if (procform->prorettype != BOOLOID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("btree equal image functions must return boolean")));
+			/*
+			 * pg_amproc functions are indexed by (lefttype, righttype), but
+			 * an equalimage function can only be called at CREATE INDEX time.
+			 * The same opclass opcintype OID is always used for leftype and
+			 * righttype.  Providing a cross-type routine isn't sensible.
+			 * Reject cross-type ALTER OPERATOR FAMILY ...  ADD FUNCTION 4
+			 * statements here.
+			 */
+			if (member->lefttype != member->righttype)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+						 errmsg("btree equal image functions must not be cross-type")));
+		}
 	}
 	else if (amoid == HASH_AM_OID)
 	{
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 4e81947352..34cdde1bb9 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -44,6 +44,7 @@
 
 #include "access/detoast.h"
 #include "fmgr.h"
+#include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/expandeddatum.h"
 
@@ -323,6 +324,31 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
 	return result;
 }
 
+/*-------------------------------------------------------------------------
+ * btequalimage
+ *
+ * Generic "equalimage" support function.
+ *
+ * B-Tree operator classes whose equality function could safely be replaced by
+ * datum_image_eq() in all cases can use this as their "equalimage" support
+ * function.
+ *
+ * Currently, we unconditionally assume that any B-Tree operator class that
+ * registers btequalimage as its support function 4 must be able to safely use
+ * optimizations like deduplication (i.e. we return true unconditionally).  If
+ * it ever proved necessary to rescind support for an operator class, we could
+ * do that in a targeted fashion by doing something with the opcintype
+ * argument.
+ *-------------------------------------------------------------------------
+ */
+Datum
+btequalimage(PG_FUNCTION_ARGS)
+{
+	/* Oid		opcintype = PG_GETARG_OID(0); */
+
+	PG_RETURN_BOOL(true);
+}
+
 /*-------------------------------------------------------------------------
  * datumEstimateSpace
  *
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 1b351cbc68..875b02d643 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2783,6 +2783,26 @@ varstr_abbrev_abort(int memtupcount, SortSupport ssup)
 	return true;
 }
 
+/*
+ * Generic equalimage support function for character type's operator classes.
+ * Disables the use of deduplication with nondeterministic collations.
+ */
+Datum
+btvarstrequalimage(PG_FUNCTION_ARGS)
+{
+	/* Oid		opcintype = PG_GETARG_OID(0); */
+	Oid			collid = PG_GET_COLLATION();
+
+	check_collation_set(collid);
+
+	if (lc_collate_is_c(collid) ||
+		collid == DEFAULT_COLLATION_OID ||
+		get_collation_isdeterministic(collid))
+		PG_RETURN_BOOL(true);
+	else
+		PG_RETURN_BOOL(false);
+}
+
 Datum
 text_larger(PG_FUNCTION_ARGS)
 {
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 4a9764c2d2..1b90cbd9b5 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -522,7 +522,8 @@ my %tests = (
 						 OPERATOR 4 >=(bigint,int4),
 						 OPERATOR 5 >(bigint,int4),
 						 FUNCTION 1 (int4, int4) btint4cmp(int4,int4),
-						 FUNCTION 2 (int4, int4) btint4sortsupport(internal);',
+						 FUNCTION 2 (int4, int4) btint4sortsupport(internal),
+						 FUNCTION 4 (int4, int4) btequalimage(oid);',
 		regexp => qr/^
 			\QALTER OPERATOR FAMILY dump_test.op_family USING btree ADD\E\n\s+
 			\QOPERATOR 1 <(bigint,integer) ,\E\n\s+
@@ -531,7 +532,8 @@ my %tests = (
 			\QOPERATOR 4 >=(bigint,integer) ,\E\n\s+
 			\QOPERATOR 5 >(bigint,integer) ,\E\n\s+
 			\QFUNCTION 1 (integer, integer) btint4cmp(integer,integer) ,\E\n\s+
-			\QFUNCTION 2 (integer, integer) btint4sortsupport(internal);\E
+			\QFUNCTION 2 (integer, integer) btint4sortsupport(internal) ,\E\n\s+
+			\QFUNCTION 4 (integer, integer) btequalimage(oid);\E
 			/xm,
 		like =>
 		  { %full_runs, %dump_test_schema_runs, section_pre_data => 1, },
@@ -1554,7 +1556,8 @@ my %tests = (
 						 OPERATOR 4 >=(bigint,bigint),
 						 OPERATOR 5 >(bigint,bigint),
 						 FUNCTION 1 btint8cmp(bigint,bigint),
-						 FUNCTION 2 btint8sortsupport(internal);',
+						 FUNCTION 2 btint8sortsupport(internal),
+						 FUNCTION 4 btequalimage(oid);',
 		regexp => qr/^
 			\QCREATE OPERATOR CLASS dump_test.op_class\E\n\s+
 			\QFOR TYPE bigint USING btree FAMILY dump_test.op_family AS\E\n\s+
@@ -1564,7 +1567,8 @@ my %tests = (
 			\QOPERATOR 4 >=(bigint,bigint) ,\E\n\s+
 			\QOPERATOR 5 >(bigint,bigint) ,\E\n\s+
 			\QFUNCTION 1 (bigint, bigint) btint8cmp(bigint,bigint) ,\E\n\s+
-			\QFUNCTION 2 (bigint, bigint) btint8sortsupport(internal);\E
+			\QFUNCTION 2 (bigint, bigint) btint8sortsupport(internal) ,\E\n\s+
+			\QFUNCTION 4 (bigint, bigint) btequalimage(oid);\E
 			/xm,
 		like =>
 		  { %full_runs, %dump_test_schema_runs, section_pre_data => 1, },
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index ac6c4423e6..f9526ac19c 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -207,7 +207,7 @@
 
  <para>
   As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
-  one required and two optional support functions.  The three
+  one required and three optional support functions.  The four
   user-defined methods are:
  </para>
  <variablelist>
@@ -456,6 +456,89 @@ returns bool
     </para>
    </listitem>
   </varlistentry>
+  <varlistentry>
+   <term><function>equalimage</function></term>
+   <listitem>
+    <para>
+     Optionally, a btree operator family may provide
+     <function>equalimage</function> (<quote>equality is image
+      equality</quote>) support functions, registered under support
+     function number 4.  These functions allow the implementation to
+     determine when it is safe to apply the btree deduplication
+     optimization.  Currently, <function>equalimage</function>
+     functions are only called when building or rebuilding an index.
+    </para>
+    <para>
+     An <function>equalimage</function> function must have the
+     signature
+<synopsis>
+equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
+</synopsis>
+     The return value is static information about an operator class
+     and collation.  Returning <literal>true</literal> indicates that
+     any two non-null values that could possibly be used as arguments
+     to the <function>order</function>/comparison function must be
+     equivalent in every way if and only if the return value is
+     <literal>0</literal>.  This condition implies that deduplication
+     is safe, provided that there is no other factor that makes
+     deduplication unsafe.  Not registering an
+     <function>equalimage</function> function or returning
+     <literal>false</literal> indicates that deduplication is unsafe.
+     Every column in an index must have an
+     <function>equalimage</function> function that returns
+     <literal>true</literal> before deduplication can be used.
+    </para>
+    <para>
+     The <replaceable>opcintype</replaceable> argument is the
+     <literal><structname>pg_type</structname>.oid</literal> of the
+     data type that is indexed.  This is a convenience that allows
+     reuse of the same underlying <function>equalimage</function>
+     function across operator classes.  If the indexed values are of a
+     collatable data type, the appropriate collation OID will be
+     passed to the <function>equalimage</function> function, using the
+     standard <function>PG_GET_COLLATION()</function> mechanism.
+    </para>
+    <para>
+     The convention followed by all operator classes with an
+     <function>equalimage</function> function that are included with
+     the core distribution is to register one of two generic
+     functions, rather than registering their own custom function.
+     Most register <function>btequalimage()</function>, which
+     unconditionally indicates that deduplication is safe.  Operator
+     classes for collatable data types such as <type>text</type>
+     register the generic <function>btvarstrequalimage()</function>
+     function to indicate that deduplication is safe with
+     deterministic collations.  Best practice for third-party
+     extensions is to register their own custom function to retain
+     control.
+    </para>
+    <para>
+     <quote>Image</quote> equality is <emphasis>almost</emphasis> the
+     same condition as simple bitwise equality.  There is one subtle
+     difference.  When indexing a varlena data type, the on-disk
+     representation of two image equal datums may not be bitwise equal
+     due to inconsistent application of <acronym>TOAST</acronym>
+     compression on input.  Formally, when an operator class's
+     <function>equalimage</function> function returns
+     <literal>true</literal>, it is safe to assume that the
+     <literal>datum_image_eq()</literal> C function will always agree
+     with the operator class's <function>order</function> function
+     (provided that the same collation OID is passed to both the
+     <function>equalimage</function> and <function>order</function>
+     functions).
+    </para>
+    <para>
+     It is not possible for the core system to deduce anything about
+     an operator class within a multiple-data-type family based only
+     on the fact that some other operator class in the same family has
+     an <function>equalimage</function> function.  It is not sensible
+     to provide a cross-type <function>equalimage</function> function,
+     and attempting to do so will result in an error (that is,
+     <function>equalimage</function> functions cannot be
+     <quote>loose</quote> within an operator family).
+    </para>
+   </listitem>
+  </varlistentry>
  </variablelist>
 
 </sect1>
diff --git a/doc/src/sgml/ref/alter_opfamily.sgml b/doc/src/sgml/ref/alter_opfamily.sgml
index 848156c9d7..4ac1cca95a 100644
--- a/doc/src/sgml/ref/alter_opfamily.sgml
+++ b/doc/src/sgml/ref/alter_opfamily.sgml
@@ -153,9 +153,10 @@ ALTER OPERATOR FAMILY <replaceable>name</replaceable> USING <replaceable class="
       and hash functions it is not necessary to specify <replaceable
       class="parameter">op_type</replaceable> since the function's input
       data type(s) are always the correct ones to use.  For B-tree sort
-      support functions and all functions in GiST, SP-GiST and GIN operator
-      classes, it is necessary to specify the operand data type(s) the function
-      is to be used with.
+      support functions, B-Tree equal image functions, and all
+      functions in GiST, SP-GiST and GIN operator classes, it is
+      necessary to specify the operand data type(s) the function is to
+      be used with.
      </para>
 
      <para>
diff --git a/doc/src/sgml/ref/create_opclass.sgml b/doc/src/sgml/ref/create_opclass.sgml
index dd5252fd97..f42fb6494c 100644
--- a/doc/src/sgml/ref/create_opclass.sgml
+++ b/doc/src/sgml/ref/create_opclass.sgml
@@ -171,12 +171,14 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL
       function is intended to support, if different from
       the input data type(s) of the function (for B-tree comparison functions
       and hash functions)
-      or the class's data type (for B-tree sort support functions and all
-      functions in GiST, SP-GiST, GIN and BRIN operator classes).  These defaults
-      are correct, and so <replaceable
-      class="parameter">op_type</replaceable> need not be specified in
-      <literal>FUNCTION</literal> clauses, except for the case of a B-tree sort
-      support function that is meant to support cross-data-type comparisons.
+      or the class's data type (for B-tree sort support functions,
+      B-tree equal image functions, and all functions in GiST,
+      SP-GiST, GIN and BRIN operator classes).  These defaults are
+      correct, and so <replaceable
+       class="parameter">op_type</replaceable> need not be specified
+      in <literal>FUNCTION</literal> clauses, except for the case of a
+      B-tree sort support function that is meant to support
+      cross-data-type comparisons.
      </para>
     </listitem>
    </varlistentry>
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index ffb5164aaa..b9ca85f9a1 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -402,7 +402,7 @@
 
   <para>
    B-trees require a comparison support function,
-   and allow two additional support functions to be
+   and allow three additional support functions to be
    supplied at the operator class author's option, as shown in <xref
    linkend="xindex-btree-support-table"/>.
    The requirements for these support functions are explained further in
@@ -441,6 +441,14 @@
        </entry>
        <entry>3</entry>
       </row>
+      <row>
+       <entry>
+        Determine if it is generally safe to apply optimizations that
+        assume that any two equal keys must also be "image equal";
+        this makes the two keys totally interchangeable (optional)
+       </entry>
+       <entry>4</entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
@@ -980,7 +988,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
   OPERATOR 5 > ,
   FUNCTION 1 btint8cmp(int8, int8) ,
   FUNCTION 2 btint8sortsupport(internal) ,
-  FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ;
+  FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
+  FUNCTION 4 btequalimage(oid) ;
 
 CREATE OPERATOR CLASS int4_ops
 DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -992,7 +1001,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
   OPERATOR 5 > ,
   FUNCTION 1 btint4cmp(int4, int4) ,
   FUNCTION 2 btint4sortsupport(internal) ,
-  FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ;
+  FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
+  FUNCTION 4 btequalimage(oid) ;
 
 CREATE OPERATOR CLASS int2_ops
 DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1004,7 +1014,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
   OPERATOR 5 > ,
   FUNCTION 1 btint2cmp(int2, int2) ,
   FUNCTION 2 btint2sortsupport(internal) ,
-  FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ;
+  FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
+  FUNCTION 4 btequalimage(oid) ;
 
 ALTER OPERATOR FAMILY integer_ops USING btree ADD
   -- cross-type comparisons int8 vs int2
diff --git a/src/test/regress/expected/alter_generic.out b/src/test/regress/expected/alter_generic.out
index ac5183c90e..ba5ce7a17e 100644
--- a/src/test/regress/expected/alter_generic.out
+++ b/src/test/regress/expected/alter_generic.out
@@ -354,9 +354,9 @@ ERROR:  invalid operator number 0, must be between 1 and 5
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD OPERATOR 1 < ; -- operator without argument types
 ERROR:  operator argument types must be specified in ALTER OPERATOR FAMILY
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 0 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR:  invalid function number 0, must be between 1 and 3
+ERROR:  invalid function number 0, must be between 1 and 4
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD FUNCTION 6 btint42cmp(int4, int2); -- function number should be between 1 and 5
-ERROR:  invalid function number 6, must be between 1 and 3
+ERROR:  invalid function number 6, must be between 1 and 4
 ALTER OPERATOR FAMILY alt_opf4 USING btree ADD STORAGE invalid_storage; -- Ensure STORAGE is not a part of ALTER OPERATOR FAMILY
 ERROR:  STORAGE cannot be specified in ALTER OPERATOR FAMILY
 DROP OPERATOR FAMILY alt_opf4 USING btree;
@@ -493,6 +493,10 @@ ALTER OPERATOR FAMILY alt_opf18 USING btree ADD
   OPERATOR 4 >= (int4, int2) ,
   OPERATOR 5 > (int4, int2) ,
   FUNCTION 1 btint42cmp(int4, int2);
+-- Should fail. Not allowed to have cross-type equalimage function.
+ALTER OPERATOR FAMILY alt_opf18 USING btree
+  ADD FUNCTION 4 (int4, int2) btequalimage(oid);
+ERROR:  btree equal image functions must not be cross-type
 ALTER OPERATOR FAMILY alt_opf18 USING btree DROP FUNCTION 2 (int4, int4);
 ERROR:  function 2(integer,integer) does not exist in operator family "alt_opf18"
 DROP OPERATOR FAMILY alt_opf18 USING btree;
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index c19740e5db..4a95815bb1 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2111,6 +2111,44 @@ WHERE p1.amproc = p2.oid AND
 --------------+--------+--------
 (0 rows)
 
+-- Almost all of the core distribution's Btree opclasses can use one of the
+-- two generic "equalimage" functions as their support function 4.  Look for
+-- opclasses that don't allow deduplication unconditionally here.
+--
+-- Newly added Btree opclasses don't have to support deduplication.  It will
+-- usually be trivial to add support, though.  Note that the expected output
+-- of this part of the test will need to be updated when a new opclass does
+-- not or cannot support deduplication.
+SELECT amp.amproc::regproc AS proc, opf.opfname AS opfamily_name,
+       opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
+FROM pg_am am
+JOIN pg_opclass AS opc ON opc.opcmethod = am.oid
+JOIN pg_opfamily AS opf ON opc.opcfamily = opf.oid
+LEFT JOIN pg_amproc AS amp ON amp.amprocfamily = opf.oid AND
+    amp.amproclefttype = opc.opcintype AND amp.amprocnum = 4
+WHERE am.amname = 'btree' AND
+    amp.amproc IS DISTINCT FROM 'btequalimage'::regproc
+ORDER BY 1, 2, 3;
+        proc        |  opfamily_name   |   opclass_name   |    opcintype     
+--------------------+------------------+------------------+------------------
+ btvarstrequalimage | bpchar_ops       | bpchar_ops       | character
+ btvarstrequalimage | text_ops         | name_ops         | name
+ btvarstrequalimage | text_ops         | text_ops         | text
+ btvarstrequalimage | text_ops         | varchar_ops      | text
+                    | array_ops        | array_ops        | anyarray
+                    | enum_ops         | enum_ops         | anyenum
+                    | float_ops        | float4_ops       | real
+                    | float_ops        | float8_ops       | double precision
+                    | jsonb_ops        | jsonb_ops        | jsonb
+                    | money_ops        | money_ops        | money
+                    | numeric_ops      | numeric_ops      | numeric
+                    | range_ops        | range_ops        | anyrange
+                    | record_image_ops | record_image_ops | record
+                    | record_ops       | record_ops       | record
+                    | tsquery_ops      | tsquery_ops      | tsquery
+                    | tsvector_ops     | tsvector_ops     | tsvector
+(16 rows)
+
 -- **************** pg_index ****************
 -- Look for illegal values in pg_index fields.
 SELECT p1.indexrelid, p1.indrelid
diff --git a/src/test/regress/sql/alter_generic.sql b/src/test/regress/sql/alter_generic.sql
index 9eeea2a87e..223d66bc2d 100644
--- a/src/test/regress/sql/alter_generic.sql
+++ b/src/test/regress/sql/alter_generic.sql
@@ -430,6 +430,9 @@ ALTER OPERATOR FAMILY alt_opf18 USING btree ADD
   OPERATOR 4 >= (int4, int2) ,
   OPERATOR 5 > (int4, int2) ,
   FUNCTION 1 btint42cmp(int4, int2);
+-- Should fail. Not allowed to have cross-type equalimage function.
+ALTER OPERATOR FAMILY alt_opf18 USING btree
+  ADD FUNCTION 4 (int4, int2) btequalimage(oid);
 ALTER OPERATOR FAMILY alt_opf18 USING btree DROP FUNCTION 2 (int4, int4);
 DROP OPERATOR FAMILY alt_opf18 USING btree;
 
diff --git a/src/test/regress/sql/opr_sanity.sql b/src/test/regress/sql/opr_sanity.sql
index 624bea46ce..31485a434c 100644
--- a/src/test/regress/sql/opr_sanity.sql
+++ b/src/test/regress/sql/opr_sanity.sql
@@ -1323,6 +1323,24 @@ WHERE p1.amproc = p2.oid AND
     p1.amproclefttype != p1.amprocrighttype AND
     p2.provolatile = 'v';
 
+-- Almost all of the core distribution's Btree opclasses can use one of the
+-- two generic "equalimage" functions as their support function 4.  Look for
+-- opclasses that don't allow deduplication unconditionally here.
+--
+-- Newly added Btree opclasses don't have to support deduplication.  It will
+-- usually be trivial to add support, though.  Note that the expected output
+-- of this part of the test will need to be updated when a new opclass does
+-- not or cannot support deduplication.
+SELECT amp.amproc::regproc AS proc, opf.opfname AS opfamily_name,
+       opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype
+FROM pg_am am
+JOIN pg_opclass AS opc ON opc.opcmethod = am.oid
+JOIN pg_opfamily AS opf ON opc.opcfamily = opf.oid
+LEFT JOIN pg_amproc AS amp ON amp.amprocfamily = opf.oid AND
+    amp.amproclefttype = opc.opcintype AND amp.amprocnum = 4
+WHERE am.amname = 'btree' AND
+    amp.amproc IS DISTINCT FROM 'btequalimage'::regproc
+ORDER BY 1, 2, 3;
 
 -- **************** pg_index ****************
 
-- 
2.17.1

#141

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#140)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v34, which has this change. My plan is to commit something
very close to this on Wednesday morning (barring any objections).

Pushed.

I'm going to delay committing the pageinspect patch until tomorrow,
since I haven't thought about that aspect of the project in a while.
Seems like a good idea to go through it one more time, once it's clear
that the buildfarm is stable. The buildfarm appears to be stable now,
though there was an issue with a compiler warning earlier. I quickly
pushed a fix for that, and can see that longfin is green/passing now.

Thanks for sticking with this project, Anastasia.
--
Peter Geoghegan

#142

Fujii Masao

masao.fujii@oss.nttdata.com

almost 6 years ago

In reply to: Peter Geoghegan (#141)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On 2020/02/27 7:43, Peter Geoghegan wrote:

On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v34, which has this change. My plan is to commit something
very close to this on Wednesday morning (barring any objections).

Pushed.

Thanks for committing this nice feature!

Here is one minor comment.

+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>

This should be

<primary><varname>deduplicate_items</varname> storage parameter</primary>

<secondary> for reloption is necessary only when the GUC parameter
with the same name of the reloption exists. So, for example, you can
see that <secondary> is used in vacuum_cleanup_index_scale_factor
but not in buffering reloption.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

#143

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Fujii Masao (#142)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Wed, Feb 26, 2020 at 10:03 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:

Thanks for committing this nice feature!

You're welcome!

Here is one minor comment.
+      <primary><varname>deduplicate_items</varname></primary>
+      <secondary>storage parameter</secondary>
This should be

<primary><varname>deduplicate_items</varname> storage parameter</primary>

I pushed a fix for this.

Thanks
--
Peter Geoghegan

#144

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Peter Geoghegan (#141)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On 2020-02-26 14:43:27 -0800, Peter Geoghegan wrote:

On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v34, which has this change. My plan is to commit something
very close to this on Wednesday morning (barring any objections).

Pushed.

Congrats!

#145

Peter Geoghegan

pg@bowt.ie

almost 6 years ago

In reply to: Andres Freund (#144)

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

On Fri, Mar 6, 2020 at 11:00 AM Andres Freund <andres@anarazel.de> wrote:

Pushed.

Congrats!

Thanks Andres!

--
Peter Geoghegan